Methods of determining tissues and/or cell types giving rise to cell-free dna, and methods of identifying a disease or disorder using same

ABSTRACT

The present disclosure provides methods of determining one or more tissues and/or cell-types contributing to cell-free DNA (“cfDNA”) in a biological sample of a subject. In some embodiments, the present disclosure provides a method of identifying a disease or disorder in a subject as a function of one or more determined more tissues and/or cell-types contributing to cfDNA in a biological sample from the subject.

PRIORITY CLAIM

This application claims priority to U.S. Provisional Application No.62/029,178, filed Jul. 25, 2014, and 62/087,619, filed Dec. 4, 2014, thesubject matter of each of which is hereby incorporated by reference asif fully set forth herein.

STATEMENT OF GOVERNMENT INTEREST

This invention was made with government support under Grant Nos.1DP1HG007811 awarded by the National Institutes of Health (NIH). Thegovernment has certain rights in the invention.

TECHNICAL FIELD

The present disclosure relates to methods of determining one or moretissues and/or cell-types giving rise to cell-free DNA. In someembodiments, the present disclosure provides a method of identifying adisease or disorder in a subject as a function of one or more determinedtissues and/or cell-types associated with cell-free DNA in a biologicalsample from the subject.

BACKGROUND

Cell-free DNA (“cfDNA”) is present in the circulating plasma, urine, andother bodily fluids of humans. The cfDNA comprises double-stranded DNAfragments that are relatively short (overwhelmingly less than 200base-pairs) and are normally at a low concentration (e.g. 1-100 ng/mL inplasma). In the circulating plasma of healthy individuals, cfDNA isbelieved to primarily derive from apoptosis of blood cells (i.e., normalcells of the hematopoietic lineage). However, in specific situations,other tissues can contribute substantially to the composition of cfDNAin bodily fluids such as circulating plasma.

While cfDNA has been used in certain specialties (e.g., reproductivemedicine, cancer diagnostics, and transplant medicine), existing testsbased on cfDNA rely on differences in genotypes (e.g., primary sequenceor copy number representation of a particular sequence) between two ormore cell populations (e.g., maternal genome vs. fetal genome; normalgenome vs. cancer genome; transplant recipient genome vs. donor genome,etc.). Unfortunately, because the overwhelming majority of cfDNAfragments found in any given biological sample derive from regions ofthe genome that are identical in sequence between the contributing cellpopulations, existing cfDNA-based tests are extremely limited in theirscope of application. In addition, many diseases and disorders areaccompanied by changes in the tissues and/or cell-types giving rise tocfDNA, for example from tissue damage or inflammatory processesassociated with the disease or disorder. Existing cfDNA-based diagnostictests relying on differences in primary sequence or copy numberrepresentation of particular sequences between two genomes cannot detectsuch changes. Thus, while the potential for cfDNA to provide powerfulbiopsy-free diagnostic methods is enormous, there still remains a needfor cfDNA-based diagnostic methodologies that can be applied to diagnosea wide variety of diseases and disorders.

SUMMARY

The present disclosure provides methods of determining one or moretissues and/or cell-types giving rise to cell-free DNA (“cfDNA”) in abiological sample of a subject. In some embodiments, the presentdisclosure provides a method of identifying a disease or disorder in asubject as a function of one or more determined tissues and/orcell-types associated with cfDNA in a biological sample from thesubject.

In some embodiments, the present disclosure provides a method ofdetermining tissues and/or cell types giving rise to cell-free DNA(cfDNA) in a subject, the method comprising isolating cfDNA from abiological sample from the subject, the isolated cfDNA comprising aplurality of cfDNA fragments; determining a sequence associated with atleast a portion of the plurality of cfDNA fragments; determining agenomic location within a reference genome for at least some cfDNAfragment endpoints of the plurality of cfDNA fragments as a function ofthe cfDNA fragment sequences; and determining at least some of thetissues and/or cell types giving rise to the cfDNA fragments as afunction of the genomic locations of at least some of the cfDNA fragmentendpoints.

In other embodiments, the present disclosure provides a method ofidentifying a disease or disorder in a subject, the method comprisingisolating cell-free DNA (cfDNA) from a biological sample from thesubject, the isolated cfDNA comprising a plurality of cfDNA fragments;determining a sequence associated with at least a portion of theplurality of cfDNA fragments; determining a genomic location within areference genome for at least some cfDNA fragment endpoints of theplurality of cfDNA fragments as a function of the cfDNA fragmentsequences; determining at least some of the tissues and/or cell typesgiving rise to the cfDNA as a function of the genomic locations of atleast some of the cfDNA fragment endpoints; and identifying the diseaseor disorder as a function of the determined tissues and/or cell typesgiving rise to the cfDNA.

In other embodiments, the present disclosure provides a method fordetermining tissues and/or cell types giving rise to cell-free DNA(cfDNA) in a subject, the method comprising: (i) generating a nucleosomemap by obtaining a biological sample from the subject, isolating thecfDNA from the biological sample, and measuring distributions (a), (b)and/or (c) by library construction and massively parallel sequencing ofcfDNA; (ii) generating a reference set of nucleosome maps by obtaining abiological sample from control subjects or subjects with known disease,isolating the cfDNA from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of cfDNA; and (iii) determining tissues and/or cell typesgiving rise to the cfDNA from the biological sample by comparing thenucleosome map derived from the cfDNA from the biological sample to thereference set of nucleosome maps; wherein (a), (b) and (c) are: (a) thedistribution of likelihoods any specific base-pair in a human genomewill appear at a terminus of a cfDNA fragment; (b) the distribution oflikelihoods that any pair of base-pairs of a human genome will appear asa pair of termini of a cfDNA fragment; and (c) the distribution oflikelihoods that any specific base-pair in a human genome will appear ina cfDNA fragment as a consequence of differential nucleosome occupancy.

In yet other embodiments, the present disclosure provides a method fordetermining tissues and/or cell types giving rise to cfDNA in a subject,the method comprising: (i) generating a nucleosome map by obtaining abiological sample from the subject, isolating the cfDNA from thebiological sample, and measuring distributions (a), (b) and/or (c) bylibrary construction and massively parallel sequencing of cfDNA; (ii)generating a reference set of nucleosome maps by obtaining a biologicalsample from control subjects or subjects with known disease, isolatingthe cfDNA from the biological sample, measuring distributions (a), (b)and/or (c) by library construction and massively parallel sequencing ofDNA derived from fragmentation of chromatin with an enzyme such asmicrococcal nuclease, DNase, or transposase; and (iii) determiningtissues and/or cell types giving rise to the cfDNA from the biologicalsample by comparing the nucleosome map derived from the cfDNA from thebiological sample to the reference set of nucleosome maps; wherein (a),(b) and (c) are: (a) the distribution of likelihoods any specificbase-pair in a human genome will appear at a terminus of a sequencedfragment; (b) the distribution of likelihoods that any pair ofbase-pairs of a human genome will appear as a pair of termini of asequenced fragment; and (c) the distribution of likelihoods that anyspecific base-pair in a human genome will appear in a sequenced fragmentas a consequence of differential nucleosome occupancy.

In other embodiments, the present disclosure provides a method fordiagnosing a clinical condition in a subject, the method comprising: (i)generating a nucleosome map by obtaining a biological sample from thesubject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA; (ii) generating a reference set ofnucleosome maps by obtaining a biological sample from control subjectsor subjects with known disease, isolating the cfDNA from the biologicalsample, measuring distributions (a), (b) and/or (c) by libraryconstruction and massively parallel sequencing of cfDNA; and (iii)determining the clinical condition by comparing the nucleosome mapderived from the cfDNA from the biological sample to the reference setof nucleosome maps; wherein (a), (b) and (c) are: (a) the distributionof likelihoods any specific base-pair in a human genome will appear at aterminus of a cfDNA fragment; (b) the distribution of likelihoods thatany pair of base-pairs of a human genome will appear as a pair oftermini of a cfDNA fragment; and (c) the distribution of likelihoodsthat any specific base-pair in a human genome will appear in a cfDNAfragment as a consequence of differential nucleosome occupancy.

In other embodiments, the present disclosure provides a method fordiagnosing a clinical condition in a subject, the method comprising (i)generating a nucleosome map by obtaining a biological sample from thesubject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA; (ii) generating a reference set ofnucleosome maps by obtaining a biological sample from control subjectsor subjects with known disease, isolating the cfDNA from the biologicalsample, measuring distributions (a), (b) and/or (c) by libraryconstruction and massively parallel sequencing of DNA derived fromfragmentation of chromatin with an enzyme such as micrococcal nuclease(MNase), DNase, or transposase; and (iii) determining thetissue-of-origin composition of the cfDNA from the biological sample bycomparing the nucleosome map derived from the cfDNA from the biologicalsample to the reference set of nucleosome maps; wherein (a), (b) and (c)are: (a) the distribution of likelihoods any specific base-pair in ahuman genome will appear at a terminus of a sequenced fragment; (b) thedistribution of likelihoods that any pair of base-pairs of a humangenome will appear as a pair of termini of a sequenced fragment; and (c)the distribution of likelihoods that any specific base-pair in a humangenome will appear in a sequenced fragment as a consequence ofdifferential nucleosome occupancy.

These and other embodiments are described in greater detail below.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows three types of information that relate cfDNA fragmentationpatterns to nucleosome occupancy, exemplified for a small genomicregion. These same types of information might also arise throughfragmentation of chromatin with an enzyme such as micrococcal nuclease(MNase), DNase, or transposase. FIG. 1A shows the distribution oflikelihoods any specific base-pair in a human genome will appear at aterminus of a sequenced fragment (i.e. points of fragmentation); FIG. 1Bshows the distribution of likelihoods that any pair of base-pairs of ahuman genome will appear as a pair of termini of a sequenced fragment(i.e. consecutive pairs of fragmentation points that give rise to anindividual molecule); and FIG. 1C shows the distribution of likelihoodsthat any specific base-pair in a human genome will appear within asequenced fragment (i.e. relative coverage) as a consequence ofdifferential nucleosome occupancy.

FIG. 2 shows insert size distribution of a typical cfDNA sequencinglibrary; here shown for the pooled cfDNA sample derived from humanplasma containing contributions from an unknown number of healthyindividuals (bulk.cfDNA).

FIG. 3A shows average periodogram intensities from Fast FourierTransformation (FFT) of read start coordinates mapping to the first(chr1) human autosome across all cfDNA samples (Plasma), cfDNA fromtumor patient samples (Tumor), cfDNA from pregnant female individuals(Pregnancy), MNase of human different human cell lines (Cell lines) anda human DNA shotgun sequencing library (Shotgun).

FIG. 3B shows average periodogram intensities from Fast FourierTransformation (FFT) of read start coordinates mapping to the last(chr22) human autosome across all cfDNA samples (Plasma), cfDNA fromtumor patient samples (Tumor), cfDNA from pregnant female individuals(Pregnancy), MNase of human different human cell lines (Cell lines) anda human DNA shotgun sequencing library (Shotgun).

FIG. 4 shows first three principal components (PC) of intensities at 196base-pairs (bp) periodicity in 10 kilobase-pair (kbp) blocks across allautosomes: FIG. 4A shows PC 2 vs. PC 1; FIG. 4B shows PC 3 vs. PC 2.

FIG. 5 shows hierarchical clustering dendogram of Euclidean distances ofintensities measured at 196 bp periodicity in 10 kbp blocks across allautosomes.

FIG. 6 shows first three principal components of intensities at 181 bpto 202 bp periodicity in 10 kbp blocks across all autosomes: FIG. 6Ashows PC 2 vs. PC 1; FIG. 6B shows PC3 vs. PC 2.

FIG. 7 shows hierarchical clustering dendogram of Euclidean distances ofintensities measured at 181 bp to 202 bp periodicity in 10 kbp blocksacross all autosomes.

FIG. 8 shows principal component analysis (first 7 of 10 PCs) ofintensities at 181 bp to 202 bp periodicity in 10 kbp blocks across allautosomes for the cfDNA data sets: FIG. 8A shows PC 2 vs. PC 1; FIG. 8Bshows PC 3 vs. PC 2; FIG. 8C shows PC 4 vs. PC 3; FIG. 8D shows PC 5 vs.PC 4; FIG. 8E shows PC 6 vs. PC 5; FIG. 8F shows PC 7 vs. PC 6.

FIG. 9 shows principal component analysis of intensities at 181 bp to202 bp periodicity in 10 kbp blocks across all autosomes for the MNasedata sets: FIG. 9A shows PC 2 vs. PC 1; FIG. 9B shows PC 3 vs. PC 2;FIG. 9C shows PC 4 vs. PC 3; FIG. 9D shows PC 5 vs. PC 4; FIG. 9E showsPC 6 vs. PC 5.

FIG. 10 shows average periodogram intensities for a representative humanautosome (chr11) across all synthetic cfDNA and MNase data set mixtures:

FIG. 11 shows first two principal components of intensities at 181 bp to202 bp periodicity in 10 kbp blocks across all autosomes for thesynthetic MNase data set mixtures.

FIG. 12 shows first two principal components of intensities at 181 bp to202 bp periodicity in 10 kbp blocks across all autosomes for thesynthetic cfDNA data set mixtures.

FIG. 13 shows hierarchical clustering dendogram of Euclidean distancesof intensities at 181 bp to 202 bp periodicity in 10 kbp blocks acrossall autosomes for the synthetic MNase and cfDNA mixture data sets.

FIG. 14 shows read-start density in 1 kbp window around 23,666 CTCFbinding sites for a set of samples with at least 100M reads.

FIG. 15 shows read-start density in 1 kbp window around 5,644 c-Junbinding sites for a set of samples with at least 100M reads.

FIG. 16 shows read-start density for 1 kbp window around 4,417 NF-YBbinding sites for a set of samples with at least 100M reads.

FIG. 17 shows a schematic overview of the processes giving rise to cfDNAfragments. Apoptotic and/or necrotic cell death results in near-completedigestion of native chromatin. Protein-bound DNA fragments, typicallyassociated with histones or transcription factors, preferentiallysurvive digestion and are released into the circulation, while naked DNAis lost. Fragments can be recovered from peripheral blood plasmafollowing proteinase treatment. In healthy individuals, cfDNA isprimarily derived from myeloid and lymphoid cell lineages, butcontributions from one or more additional tissues may be present incertain medical conditions.

FIG. 18 shows fragment length of cfDNA observed with conventionalsequencing library preparation. Length is inferred from alignment ofpaired-end sequencing reads. A reproducible peak in fragment length at167 base-pairs (bp) (green dashed line) is consistent with associationwith chromatosomes. Additional peaks evidence ˜10.4 bp periodicity,corresponding to the helical pitch of DNA on the nucleosome core.Enzymatic end-repair during library preparation removes 5′ and 3′overhangs and may obscure true cleavage sites.

FIG. 19 shows a dinucleotide composition of 167 bp fragments andflanking genomic sequence in conventional libraries. Observeddinucleotide frequencies in the BH01 library were compared to expectedfrequencies from simulated fragments (matching for endpoint biasesresulting from both cleavage and adapter ligation preferences).

FIG. 20 shows a schematic of a single-stranded library preparationprotocol for cfDNA fragments.

FIG. 21 shows fragment length of cfDNA observed with single-strandedsequencing library preparation. No enzymatic end-repair is performed totemplate molecules during library preparation. Short fragments of 50-120bp are highly enriched compared to conventional libraries. While ˜10.4bp periodicity remains, its phase is shifted by ˜3 bp.

FIG. 22 shows a dinucleotide composition of 167 bp fragments andflanking genomic sequence in single-stranded libraries. Observeddinucleotide frequencies in the IH02 library were compared to expectedfrequencies derived from simulated fragments, again matching forendpoint biases. The apparent difference in the background level of biasbetween BH01 and IH02 relate to differences between the simulations,rather than the real libraries (data not shown).

FIG. 23A shows a gel image of representative cfDNA sequencing libraryprepared with the conventional protocol.

FIG. 23B shows a gel image of a representative cfDNA sequencing libraryprepared with the single-stranded protocol.

FIG. 24A shows mononucleotide cleavage biases of cfDNA fragments.

FIG. 24B shows dinucleotide cleavage biases of cfDNA fragments.

FIG. 25 shows a schematic overview of inference of nucleosomepositioning. A per-base windowed protection score (WPS) is calculated bysubtracting the number of fragment endpoints within a 120 bp window fromthe number of fragments completely spanning the window. High WPS valuesindicate increased protection of DNA from digestion; low values indicatethat DNA is unprotected. Peak calls identify contiguous regions ofelevated WPS.

FIG. 26 shows strongly positioned nucleosomes at a well-studiedalpha-satellite array. Coverage, fragment endpoints, and WPS values fromsample CH01 are shown for long fragment (120 bp window; 120-180 bpreads) or short fragment (16 bp window; 35-80 bp reads) bins at apericentromeric locus on chromosome 12. Nucleosome calls from CH01(middle, blue boxes) are regularly spaced across the locus. Nucleosomecalls based on MNase digestion from two published studies (middle,purple and black boxes) are also displayed. The locus overlaps with anannotated alpha-satellite array.

FIG. 27 shows inferred nucleosome positioning around a DNase Ihypersensitive site (DHS) on chromosome 9. Coverage, fragment endpoints,and WPS values from sample CH01 are shown for long and short fragmentbins. The hypersensitive region, highlighted in gray, is marked byreduced coverage in the long fragment bin. Nucleosome calls from CH01(middle, blue boxes) adjacent to the DHS are spaced more widely thantypical adjacent pairs, consistent with accessibility of the interveningsequence to regulatory proteins including transcription factors.Coverage of shorter fragments, which may be associated with suchproteins, is increased at the DHS, which overlaps with several annotatedtranscription factor binding sites (not shown). Nucleosome calls basedon MNase digestion from two published studies are shown as in FIG. 26.

FIG. 28 shows a schematic of peak calling and scoring according to oneembodiment of the present disclosure.

FIG. 29 shows CH01 peak density by GC content.

FIG. 30 shows a histogram of distances between adjacent peaks by sample.Distances are measured from peak call to adjacent call.

FIG. 31 shows a comparison of peak calls between samples. For each pairof samples, the distances between each peak call in the sample withfewer peaks and the nearest peak call in the other sample are calculatedand visualized as a histogram with bin size of 1. Negative numbersindicate the nearest peak is upstream; positive numbers indicate thenearest peak is downstream.

FIG. 32 shows a comparison of peak calls between samples: FIG. 32A showsIH01 vs. BH01; FIG. 32B shows IH02 vs. BH01; FIG. 32C shows IH02 vs.IH01.

FIG. 33A shows nucleosome scores for real vs. simulated peaks.

FIG. 33B shows median peak offset within a score bin as a function ofthe score bin (left y-axis), and the number of peaks in each score bin(right y-axis).

FIG. 34 shows a comparison of peak calls between samples and matchedsimulations: FIG. 34A shows BH01 simulation vs. BH01 actual; FIG. 34Bshows IH01 simulation vs. IH01 actual; FIG. 34C shows IH02 simulationvs. IH01 actual.

FIG. 35 shows distances between adjacent peaks, sample CH01. The dottedblack line indicates the mode of the distribution (185 bp).

FIG. 36 shows aggregate, adjusted windowed protection scores (WPS; 120bp window) around 22,626 transcription start sites (TSS). TSS arealigned at the 0 position after adjusting for strand and direction oftranscription. Aggregate WPS is tabulated for both real data andsimulated data by summing per-TSS WPS at each position relative to thecentered TSS. The values plotted represent the difference between thereal and simulated aggregate WPS, further adjusted to local backgroundas described in greater detail below. Higher WPS values indicatepreferential protection from cleavage.

FIG. 37 shows aggregate, adjusted WPS around 22,626 start codons.

FIG. 38 shows aggregate, adjusted WPS around 224,910 splice donor sites.

FIG. 39 shows aggregate, adjusted WPS around 224,910 splice acceptorsites.

FIG. 40 shows aggregate, adjusted WPS around various genic features withdata from CH01, including for real data, matched simulation, and theirdifference.

FIG. 41 shows nucleosome spacing in NB compartments. Median nucleosomespacing in non-overlapping 100 kilobase (kb) bins, each containing ˜500nucleosome calls, is calculated genome-wide. A/B compartment predictionsfor GM12878, also with 100 kb resolution, are from published sources.Compartment A is associated with open chromatin and compartment B withclosed chromatin.

FIG. 42 shows nucleosome spacing and A/B compartments on chromosomes 7and 11. A/B segmentation (red and blue bars) largely recapitulateschromosomal G-banding (ideograms, gray bars). Median nucleosome spacing(black dots) is calculated in 100 kb bins and plotted above the NBsegmentation.

FIG. 43 shows aggregate, adjusted WPS for 93,550 CTCF sites for the long(top) and short (bottom) fractions.

FIG. 44 shows a zoomed-in view of the aggregate, adjusted WPS for shortfraction cfDNA at CTCF sites. The light red bar (and correspondingshading within the plot) indicate the position of the known 52 bp CTCFbinding motif. The dark red subsection of this bar indicates thelocation of the 17 bp motif used for the FIMO motif search.

FIG. 45 shows −1 to +1 nucleosome spacing calculated around CTCF sitesderived from clustered FIMO predicted CTCF sites (purely motif-based:518,632 sites), a subset of these predictions overlapping with ENCODEChIP-seq peaks (93,530 sites), and a further subset that have beenexperimentally observed to be active across 19 cell lines (23,723sites). The least stringent set of CTCF sites are predominantlyseparated by distances that are approximately the same as thegenome-wide average (˜190 bp). However, at the highest stringency, mostCTCF sites are separated by a much wider distance (˜260 bp), consistentwith active CTCF binding and repositioning of adjacent nucleosomes.

FIGS. 46-48 show CTCF occupancy repositions flanking nucleosomes: FIG.46 shows inter-peak distances for the three closest upstream and threeclosest downstream peak calls for 518,632 CTCF binding sites predictedby FIMO. FIG. 47 shows inter-peak distances for the three closestupstream and three closest downstream peak calls for 518,632 CTCFbinding sites predicted by FIMO as in FIG. 46, but where the same set ofCTCF sites has been filtered based on overlap with ENCODE ChIP-seqpeaks, leaving 93,530 sites. FIG. 48 shows inter-peak distances for thethree closest upstream and three closest downstream peak calls for93,530 CTCF binding sites predicted by FIMO as in FIG. 47, but where theset of CTCF sites has been filtered based on overlap with the set ofactive CTCF sites experimentally observed across 19 cell lines, leaving23,732 sites.

FIG. 49 shows, for the subset of putative CTCF sites with flankingnucleosomes spaced widely (230-270 bp), that both the long (top) andshort (bottom) fractions exhibit a stronger signal of positioning withincreasingly stringent subsets of CTCF sites. See FIG. 45 for keydefining colored lines.

FIGS. 50-52 show CTCF occupancy repositions flanking nucleosomes: FIG.50 shows mean short fraction WPS (top panel) and mean long fraction WPS(bottom panel) for the 518,632 sites, partitioned into distance binsdenoting the number of base-pairs separating the flanking +1 and −1nucleosome calls for each site. FIG. 51 shows mean short fraction WPS(top panel) and mean long fraction WPS (bottom panel) for the 518,632sites of FIG. 50, but where the same set of CTCF sites has been filteredbased on overlap with ENCODE ChIP-seq peaks. FIG. 52 shows mean shortfraction WPS (top panel) and mean long fraction WPS (bottom panel) forthe sites of FIG. 51, but where the same set of sites has been furtherfiltered based on overlap with the set of active CTCF sitesexperimentally observed across 19 cell lines. Key defining colored linesfor FIG. 50 is the same as in FIG. 51 and FIG. 52.

FIGS. 53A-H show footprints of transcription factor binding sites fromshort and long cfDNA fragments. Clustered FIMO binding sites predictionswere intersected with ENCODE ChIP-seq data to obtain a confident set oftranscription factor (TF) binding sites for a set of additional factors.Aggregate, adjusted WPS for regions flanking the resulting sets of TFbinding sites is displayed for both the long and short fractions ofcfDNA fragments. Higher WPS values indicate higher likelihood ofnucleosome or TF occupancy, respectively. FIG. 53A: AP-2; FIG. 53B:E2F-2; FIG. 53C: EBOX-TF; FIG. 53D: IRF; FIG. 53E: MYC-MAX; FIG. 53F:PAX5-2; FIG. 53G: RUNX-AML; FIG. 53H: YY1.

FIG. 54 shows aggregate, adjusted WPS for transcription factor ETS(210,798 sites). WPS calculated from both long (top) and short (bottom)cfDNA fractions are shown. Signal consistent with TF protection at thebinding site itself (short fraction) with organization of thesurrounding nucleosomes (long fraction) is observed. Similar analysesfor additional TFs are shown in FIGS. 53A-H.

FIG. 55 shows aggregate, adjusted WPS for transcription factor MAFK(32,159 sites). WPS calculated from both long (top) and short (bottom)cfDNA fractions are shown. Signal consistent with TF protection at thebinding site itself (short fraction) with organization of thesurrounding nucleosomes (long fraction) is observed. Similar analysesfor additional TFs are shown in FIGS. 53A-H.

FIG. 56 shows the inference of mixtures of cell-types contributing tocell-free DNA based on DNase hypersensitivity (DHS) sites. The frequencydistribution of peak-to-peak spacing of nucleosome calls at DHS sitesfrom 116 diverse biological samples shows a bimodal distribution, withthe second mode plausibly corresponding to widened nucleosome spacing atactive DHS sites due to intervening transcription factor binding (˜190bp→260 bp). DHS sites identified in lymphoid or myeloid samples have thelargest proportions of DHS sites with widened nucleosome spacing,consistent with hematopoietic cell death as the dominant source of cfDNAin healthy individuals.

FIG. 57 shows how partitioning of adjusted WPS scores aroundtranscriptional start sites (TSS) into five gene expression bins(quintiles) defined for NB-4 (an acute promyelocytic leukemia cell line)reveals differences in the spacing and placement of nucleosomes. Highlyexpressed genes show a strong phasing of nucleosomes within thetranscript body. Upstream of the TSS, −1 nucleosomes are well-positionedacross expression bins, but −2 and −3 nucleosomes are onlywell-positioned for medium to highly expressed genes.

FIG. 58 shows that, for medium to highly expressed genes, a shortfragment peak is observed between the TSS and the −1 nucleosome,consistent with footprinting of the transcription preinitiation complex,or some component thereof, at transcriptionally active genes.

FIG. 59 shows that median nucleosome distance in the transcript body isnegatively correlated with gene expression as measured for the NB-4 cellline (ρ=−0.17, n=19,677 genes). Genes with little-to-no expression showa median nucleosome distance of 193 bp, while for expressed genes, thisranges between 186-193 bp. This negative correlation is stronger whenmore nucleosome calls are used to determine a more precise mediandistance (e.g. requiring at least 60 nucleosomes, ρ=−0.50; n=12,344genes).

FIG. 60 shows how, to deconvolve multiple contributions, fast Fouriertransformation (FFT) was used to quantify the abundance of specificfrequency contributions (intensities) in the long fragment WPS for thefirst 10 kb of gene bodies starting at each TSS. Shown are trajectoriesof correlation between RNA expression in 76 cell lines and primarytissues with these intensities at different frequencies. Marked with abold black line is the NB-4 cell line. Correlations are strongest inmagnitude for intensities in the 193-199 bp frequency range.

FIG. 61 shows the inference of cell-types contributing to cell-free DNAin healthy states and cancer. The top panel shows the ranks ofcorrelation for 76 RNA expression datasets with average intensity in the193-199 bp frequency range for various cfDNA libraries, categorized bytype and listed from highest rank (top rows) to lowest rank (bottomrows). Correlation values and full cell line or tissue names areprovided in Table 3. All of the strongest correlations for all threehealthy samples (BH01, IH01 and IH02; first three columns) are withlymphoid and myeloid cell lines as well as bone marrow. In contrast,cfDNA samples obtained from stage IV cancer patients (IC15, IC17, IC20,IC35, IC37; last five columns) show top correlations with various cancercell lines, e.g. IC17 (hepatocellular carcinoma, HCC) showing highestcorrelations with HepG2 (hepatocellular carcinoma cell line), and IC35(breast ductal carcinoma, DC) with MCF7 (metastatic breastadenocarcinoma cell line). When comparing cell line/tissue ranksobserved for the cancer samples to each of the three healthy samples andaveraging the rank changes (bottom panel), maximum rank changes are morethan 2× higher than those observed from comparing the three healthysamples with each other and averaging rank changes (‘Control’). Forexample, for IC15 (small cell lung carcinoma, SCLC) the rank of SCLC-21H(small cell lung carcinoma cell line) increased by an average of 31positions, for IC20 (squamous cell lung carcinoma, SCC) SK-BR-3(metastatic breast adenocarcinoma cell line) increased by an averagerank of 21, and for IC37 (colorectal adenocarcinoma, AC) HepG2 increasedby 24 ranks.

FIG. 62 shows quantitation of aneuploidy to select samples with highburden of circulating tumor DNA, based on coverage (FIG. 62A) or allelebalance (FIG. 62B). FIG. 62A shows the sums of Z scores for eachchromosome calculated based on observed vs. expected numbers ofsequencing reads for each sample (black dots) compared to simulatedsamples that assume no aneuploidy (red dots). FIG. 62B shows the allelebalance at each of 48,800 common SNPs, evaluated per chromosome, for asubset of samples that were selected for additional sequencing.

FIG. 63 shows a comparison of peak calls to published nucleosome callsets: FIG. 63A shows the distance between nucleosome peak calls acrossthree published data sets (Gaffney et al. 2012, J. S. Pedersen et al.2014, and A Schep et al. 2015) as well as the calls generated here,including the matched simulation of CA01. Previously published data setsdo not show one defined mode at the canonical ˜185 bp nucleosomedistance, probably due to their sparse sampling or wide call ranges. Incontrast, all the nucleosome calls from cfDNA show one well-definedmode. The matched simulated data set has shorter mode (166 bp) and awider distribution. Further, the higher the coverage of the cfDNA dataset used to generate calls, the higher the proportion of callsrepresented by the mode of the distribution. FIG. 63B shows the numberof nucleosomes for each of the same list of sets as FIG. 63A. The cfDNAnucleosome calls present the most comprehensive call set with nearly 13Mnucleosome peak calls. FIG. 63C shows the distances between each peakcall in the IH01 cfDNA sample and the nearest peak call from threepreviously published data sets. FIG. 63D shows the distances betweeneach peak call in the IH02 cfDNA sample and the nearest peak call fromthree previously published data sets. FIG. 63E shows the distancesbetween each peak call in the BH01 cfDNA sample and the nearest peakcall from three previously published data sets. FIG. 63F shows thedistances between each peak call in the CH01 cfDNA sample and thenearest peak call from three previously published data sets. FIG. 63Gshows the distances between each peak call in the CA01 cfDNA sample andthe nearest peak call from three previously published data sets.Negative numbers indicate the nearest peak is upstream; positive numbersindicate the nearest peak is downstream. With increased cfDNA coverage,a higher proportion of previously published calls are found in closerproximity to the determined nucleosome call. Highest concordance wasfound with calls generated by Gaffney et al., PLoS Genet., vol. 8,e1003036 (2012) and A Schep et al. (2015). FIG. 63H shows the distancesbetween each peak call and the nearest peak call from three previouslypublished data sets, but this time for the matched simulation of CA01.The closest real nucleosome positions tend to be away from the peakscalled in the simulation for the Gaffney et al., PLoS Genet., vol. 8,e1003036 (2012) and JS Pedersen et al., Genome Research, vol. 24, pp.454-466 (2014) calls. Calls generated by A Schep et al. (2015) seem toshow some overlap with the simulated calls.

DETAILED DESCRIPTION

The present disclosure provides methods of determining one or moretissues and/or cell-types giving rise to cell-free DNA in a subject'sbiological sample. In some embodiments, the present disclosure providesa method of identifying a disease or disorder in a subject as a functionof one or more determined tissues and/or cell-types associated withcfDNA in a biological sample from the subject.

The present disclosure is based on a prediction that cfDNA moleculesoriginating from different cell types or tissues differ with respect to:(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a cfDNA fragment (i.e. points offragmentation); (b) the distribution of likelihoods that any pair ofbase-pairs of a human genome will appear as a pair of termini of a cfDNAfragment (i.e. consecutive pairs of fragmentation points that give riseto an individual cfDNA molecule); and (c) the distribution oflikelihoods that any specific base-pair in a human genome will appear ina cfDNA fragment (i.e. relative coverage) as a consequence ofdifferential nucleosome occupancy. These are referred to below asdistributions (a), (b) and (c), or collectively referred to as“nucleosome dependent cleavage probability maps”, “cleavageaccessibility maps” or “nucleosome maps” (FIG. 1). Of note, nucleosomemaps might also be measured through the sequencing of fragments derivedfrom the fragmentation of chromatin with an enzyme such as micrococcalnuclease (MNase), DNase, or transposase, or equivalent procedures thatpreferentially fragment genomic DNA between or at the boundaries ofnucleosomes or chromatosomes.

In healthy individuals, cfDNA overwhelmingly derives from apoptosis ofblood cells, i.e. cells of the hematopoietic lineage. As these cellsundergo programmed cell death, their genomic DNA is cleaved and releasedinto circulation, where it continues to be degraded by nucleases. Thelength distribution of cfDNA oscillates with a period of approximately10.5 base-pairs (bp), corresponding to the helical pitch of DNA coiledaround the nucleosome, and has a marked peak around 167 bp,corresponding to the length of DNA associated with a linker-associatedmononucleosome (FIG. 2). This evidence has led to the hypothesis thatcfDNA's association with the nucleosome is what protects it fromcomplete, rapid degradation in the circulation. An alternativepossibility is that the length distribution arises simply from thepattern of DNA cleavage during apoptosis itself, which is influenceddirectly by nucleosome positioning. Regardless, the length distributionof cfDNA provides clear evidence that the fragmentation processes thatgive rise to cfDNA are influenced by nucleosome positioning.

In some embodiments, the present disclosure defines a nucleosome map asthe measurement of distributions (a), (b) and/or (c) by libraryconstruction and massively parallel sequencing of either cfDNA from abodily fluid or DNA derived from the fragmentation of chromatin with anenzyme such as micrococcal nuclease (MNase), DNase, or transposase, orequivalent procedures that preferentially fragment genomic DNA betweenor at the boundaries of nucleosomes or chromatosomes. As describedbelow, these distributions may be ‘transformed’ in order to aggregate orsummarize the periodic signal of nucleosome positioning within varioussubsets of the genome, e.g. quantifying periodicity in contiguouswindows or, alternatively, in discontiguous subsets of the genomedefined by transcription factor binding sites, gene model features (e.g.transcription start sites or gene bodies), topologically associateddomains, tissue expression data or other correlates of nucleosomepositioning. Furthermore, these might be defined by tissue-specificdata. For example, one could aggregate or summarize signal in thevicinity of tissue-specific DNase I hypersensitive sites.

The present disclosure provides a dense, genome-wide map of in vivonucleosome protection inferred from plasma-borne cfDNA fragments. TheCH01 map, derived from cfDNA of healthy individuals, comprises nearly13M uniformly spaced local maxima of nucleosome protection that span thevast majority of the mappable human reference genome. Although thenumber of peaks is essentially saturated in CH01, other metrics ofquality continued to be a function of sequencing depth (FIGS. 33A-B). Anadditional genome-wide nucleosome map was therefore constructed—byidentical methods—that is based on nearly all of the cfDNA sequencingthat the inventors have performed to date, for this study and other work(‘CA01’, 14.5 billion (G) fragments; 700-fold coverage; 13.0M peaks).Although this map exhibits even more uniform spacing and more highlysupported peak calls (FIGS. 33A-B, 63A-H), we caution that it is basedon cfDNA from both healthy and non-healthy individuals (Tables 1, 5).

The dense, genome-wide map of nucleosome protection disclosed hereinapproaches saturation of the mappable portion of the human referencegenome, with peak-to-peak spacing that is considerably more uniform andconsistent with the expected nucleosome repeat length than previousefforts to generate human genome-wide maps of nucleosome positioning orprotection (FIGS. 63A-H). In contrast with nearly all previous efforts,the fragments that observed herein are generated by endogenousphysiological processes, and are therefore less likely to be subject tothe technical variation associated with in vitro micrococcal nucleasedigestion. The cell types that give rise to cfDNA considered in thisreference map are inevitably heterogeneous (e.g. a mixture of lymphoidand myeloid cell types in healthy individuals). Nonetheless, the map'srelative completeness may facilitate a deeper understanding of theprocesses that dictate nucleosome positioning and spacing in humancells, as well as the interplay of nucleosomes with epigeneticregulation, transcriptional output, and nuclear architecture.

Methods of Determining the Source(s) of cfDNA in a Subject's BiologicalSample

As discussed generally above, and as demonstrated more specifically inthe Examples which follow, the present technology may be used todetermine (e.g., predict) the tissue(s) and/or cell type(s) whichcontribute to the cfDNA in a subject's biological sample.

Accordingly, in some embodiments, the present disclosure provides amethod of determining tissues and/or cell-types giving rise to cell-freeDNA (cfDNA) in a subject, the method comprising isolating cfDNA from abiological sample from the subject, the isolated cfDNA comprising aplurality of cfDNA fragments; determining a sequence associated with atleast a portion of the plurality of cfDNA fragments; determining agenomic location within a reference genome for at least some cfDNAfragment endpoints of the plurality of cfDNA fragments as a function ofthe cfDNA fragment sequences; and determining at least some of thetissues and/or cell types giving rise to the cfDNA fragments as afunction of the genomic locations of at least some of the cfDNA fragmentendpoints.

In some embodiments, the biological sample comprises, consistsessentially of, or consists of whole blood, peripheral blood plasma,urine, or cerebral spinal fluid.

In some embodiments, the step of determining at least some of thetissues and/or cell-types giving rise to the cfDNA fragments comprisescomparing the genomic locations of at least some of the cfDNA fragmentendpoints, or mathematical transformations of their distribution, to oneor more reference maps. As used herein, the term “reference map” refersto any type or form of data which can be correlated or compared to anattribute of the cfDNA in the subject's biological sample as a functionof the coordinate within the genome to which cfDNA sequences are aligned(e.g., the reference genome). The reference map may be correlated orcompared to an attribute of the cfDNA in the subject's biological sampleby any suitable means. For example and without limitation, thecorrelation or comparison may be accomplished by analyzing frequenciesof cfDNA endpoints, either directly or after performing a mathematicaltransformation on their distribution across windows within the referencegenome, in the subject's biological sample in view of numerical valuesor any other states defined for equivalent coordinates of the referencegenome by the reference map. In another non-limiting example, thecorrelation or comparison may be accomplished by analyzing thedetermined nucleosome spacing(s) based on the cfDNA of the subject'sbiological sample in view of the determined nucleosome spacing(s), oranother property that correlates with nucleosome spacing(s), in thereference map.

The reference map(s) may be sourced or derived from any suitable datasource including, for example, public databases of genomic information,published data, or data generated for a specific population of referencesubjects which may each have a common attribute (e.g., disease status).In some embodiments, the reference map comprises a DNase Ihypersensitivity dataset. In some embodiments, the reference mapcomprises an RNA expression dataset. In some embodiments, the referencemap comprises a chromosome conformation map. In some embodiments, thereference map comprises a chromatin accessibility map. In someembodiments, the reference map comprises data that is generated from atleast one tissue or cell-type that is associated with a disease or adisorder. In some embodiments, the reference map comprises positions ofnucleosomes and/or chromatosomes in a tissue or cell type. In someembodiments, the reference map is generated by a procedure that includesdigesting chromatin with an exogenous nuclease (e.g., micrococcalnuclease). In some embodiments, the reference map comprises chromatinaccessibility data determined by a transposition-based method (e.g.,ATAC-seq). In some embodiments, the reference map comprises dataassociated with positions of a DNA binding and/or DNA occupying proteinfor a tissue or cell type. In some embodiments, the DNA binding and/orDNA occupying protein is a transcription factor. In some embodiments,the positions are determined by a procedure that includes chromatinimmunoprecipitation of a crosslinked DNA-protein complex. In someembodiments, the positions are determined by a procedure that includestreating DNA associated with the tissue or cell type with a nuclease(e.g., DNase-I). In some embodiments, the reference map is generated bysequencing of cfDNA fragments from a biological sample from one or moreindividuals with a known disease. In some embodiments, this biologicalsample from which the reference map is generated is collected from ananimal to which human cells or tissues have been xenografted.

In some embodiments, the reference map comprises a biological featurecorresponding to positions of a DNA binding or DNA occupying protein fora tissue or cell type. In some embodiments, the reference map comprisesa biological feature corresponding to quantitative RNA expression of oneor more genes. In some embodiments, the reference map comprises abiological feature corresponding to the presence or absence of one ormore histone marks. In some embodiments, the reference map comprises abiological feature corresponding to hypersensitivity to nucleasecleavage.

The step of comparing the genomic locations of at least some of thecfDNA fragment endpoints to one or more reference maps may beaccomplished in a variety of ways. In some embodiments, the cfDNA datagenerated from the biological sample (e.g., the genomic locations of thecfDNA fragments, their endpoints, the frequencies of their endpoints,and/or nucleosome spacing(s) inferred from their distribution) iscompared to more than one reference map. In such embodiments, thetissues or cell-types associated with the reference maps which correlatemost highly with the cfDNA data in the biological sample are deemed tobe contributing. For example and without limitation, if the cfDNA dataincludes a list of likely cfDNA endpoints and their locations within thereference genome, the reference map(s) having the most similar list ofcfDNA endpoints and their locations within the reference genome may bedeemed to be contributing. As another non-limiting example, thereference map(s) having the most correlation (or increased correlation,relative to cfDNA from a healthy subject) with a mathematicaltransformation of the distribution of cfDNA fragment endpoints from thebiological sample may be deemed to be contributing. The tissue typesand/or cell types which correspond to those reference maps deemed to becontributing are then considered as potential sources of the cfDNAisolated from the biological sample.

In some embodiments, the step of determining at least some of thetissues and/or cell types giving rise to the cfDNA fragments comprisesperforming a mathematical transformation on a distribution of thegenomic locations of at least some of the cfDNA fragment endpoints. Onenon-limiting example of a mathematical transformation suitable for usein connection with the present technology is a Fourier transformation,such as a fast Fourier transformation (“FFT”).

In some embodiments, the method further comprises determining a scorefor each of at least some coordinates of the reference genome, whereinthe score is determined as a function of at least the plurality of cfDNAfragment endpoints and their genomic locations, and wherein the step ofdetermining at least some of the tissues and/or cell types giving riseto the observed cfDNA fragments comprises comparing the scores to one ormore reference map. The score may be any metric (e.g., a numericalranking or probability) which may be used to assign relative or absolutevalues to a coordinate of the reference genome. For example, the scoremay consist of, or be related to a probability, such as a probabilitythat the coordinate represents a location of a cfDNA fragment endpoint,or a probability that the coordinate represents a location of the genomethat is preferentially protected from nuclease cleavage by nucleosome orprotein binding. As another example, the score may relate to nucleosomespacing in particular regions of the genome, as determined by amathematical transformation of the distribution of cfDNA fragmentendpoints within that region. Such scores may be assigned to thecoordinate by any suitable means including, for example, by countingabsolute or relative events (e.g., the number of cfDNA fragmentendpoints) associated with that particular coordinate, or performing amathematical transformation on the values of such counts in the regionor a genomic coordinate. In some embodiments, the score for a coordinateis related to the probability that the coordinate is a location of acfDNA fragment endpoint. In other embodiments, the score for acoordinate is related to the probability that the coordinate representsa location of the genome that is preferentially protected from nucleasecleavage by nucleosome or protein binding. In some embodiments, thescore is related to nucleosome spacing in the genomic region of thecoordinate.

The tissue(s) and/or cell-type(s) referred to in the methods describedherein may be any tissue or cell-type which gives rise to cfDNA. In someembodiments, the tissue or cell-type is a primary tissue from a subjecthaving a disease or disorder. In some embodiments, the disease ordisorder is selected from the group consisting of: cancer, normalpregnancy, a complication of pregnancy (e.g., aneuploid pregnancy),myocardial infarction, inflammatory bowel disease, systemic autoimmunedisease, localized autoimmune disease, allotransplantation withrejection, allotransplantation without rejection, stroke, and localizedtissue damage.

In some embodiments, the tissue or cell type is a primary tissue from ahealthy subject.

In some embodiments, the tissue or cell type is an immortalized cellline.

In some embodiments, the tissue or cell type is a biopsy from a tumor.

In some embodiments, the reference map is based on sequence dataobtained from samples obtained from at least one reference subject. Insome embodiments, this sequence data defines positions of cfDNA fragmentendpoints within a reference genome—for example, if the reference map isgenerated by sequencing of cfDNA from subject(s) with known disease. Inother embodiments, this sequence data on which the reference map isbased may comprise any one or more of: a DNase I hypersensitive sitedataset, an RNA expression dataset, a chromosome conformation map, or achromatin accessibility map, or nucleosome positioning map generated bydigestion of chromatin with micrococcal nuclease.

In some embodiments, the reference subject is healthy. In someembodiments, the reference subject has a disease or disorder, optionallyselected from the group consisting of: cancer, normal pregnancy, acomplication of pregnancy (e.g., aneuploid pregnancy), myocardialinfarction, inflammatory bowel disease, systemic autoimmune disease,localized autoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.

In some embodiments, the reference map comprises scores for at least aportion of coordinates of the reference genome associated with thetissue or cell type. In some embodiments, the reference map comprises amathematical transformation of the scores, such as a Fouriertransformation of the scores. In some embodiments, the scores are basedon annotations of reference genomic coordinates for the tissue or celltype. In some embodiments, the scores are based on positions ofnucleosomes and/or chromatosomes. In some embodiments, the scores arebased on transcription start sites and/or transcription end sites. Insome embodiments, the scores are based on predicted binding sites of atleast one transcription factor. In some embodiments, the scores arebased on predicted nuclease hypersensitive sites. In some embodiments,the scores are based on predicted nucleosome spacing.

In some embodiments, the scores are associated with at least oneorthogonal biological feature. In some embodiments, the orthogonalbiological feature is associated with highly expressed genes. In someembodiments, the orthogonal biological feature is associated with lowlyexpression genes.

In some embodiments, at least some of the plurality of the scores has avalue above a threshold (minimum) value. In such embodiments, scoresfalling below the threshold (minimum) value are excluded from the stepof comparing the scores to a reference map. In some embodiments, thethreshold value is determined before determining the tissue(s) and/orthe cell type(s) giving rise to the cfDNA. In other embodiments, thethreshold value is determined after determining the tissue(s) and/or thecell type(s) giving rise to the cfDNA.

In some embodiments, the step of determining the tissues and/or celltypes giving rise to the cfDNA as a function of a plurality of thegenomic locations of at least some of the cfDNA fragment endpointscomprises comparing a mathematical transformation of the distribution ofthe genomic locations of at least some of the cfDNA fragment endpointsof the sample with one or more features of one or more reference maps.One non-limiting example of a mathematical transformation suitable forthis purpose is a Fourier transformation, such as a fast Fouriertransformation (“FFT”).

In any embodiment described herein, the method may further comprisegenerating a report comprising a list of the determined tissues and/orcell-types giving rise to the isolated cfDNA. The report may optionallyfurther include any other information about the sample and/or thesubject, the type of biological sample, the date the biological samplewas obtained from the subject, the date the cfDNA isolation step wasperformed and/or tissue(s) and/or cell-type(s) which likely did not giverise to any cfDNA isolated from the biological sample.

In some embodiments, the report further includes a recommended treatmentprotocol including, for example and without limitation, a suggestion toobtain an additional diagnostic test from the subject, a suggestion tobegin a therapeutic regimen, a suggestion to modify an existingtherapeutic regimen with the subject, and/or a suggestion to suspend orstop an existing therapeutic regiment.

Methods of Identifying a Disease or Disorder in a Subject

As discussed generally above, and as demonstrated more specifically inthe Examples which follow, the present technology may be used todetermine (e.g., predict) a disease or disorder, or the absence of adisease or a disorder, based at least in part on the tissue(s) and/orcell type(s) which contribute to cfDNA in a subject's biological sample.

Accordingly, in some embodiments, the present disclosure provides amethod of identifying a disease or disorder in a subject, the methodcomprising isolating cell free DNA (cfDNA) from a biological sample fromthe subject, the isolated cfDNA comprising a plurality of cfDNAfragments; determining a sequence associated with at least a portion ofthe plurality of cfDNA fragments; determining a genomic location withina reference genome for at least some cfDNA fragment endpoints of theplurality of cfDNA fragments as a function of the cfDNA fragmentsequences; determining at least some of the tissues and/or cell typesgiving rise to the cfDNA as a function of the genomic locations of atleast some of the cfDNA fragment endpoints; and identifying the diseaseor disorder as a function of the determined tissues and/or cell typesgiving rise to the cfDNA.

In some embodiments, the biological sample comprises, consistsessentially of, or consists of whole blood, peripheral blood plasma,urine, or cerebral spinal fluid.

In some embodiments, the step of determining the tissues and/orcell-types giving rise to the cfDNA comprises comparing the genomiclocations of at least some of the cfDNA fragment endpoints, ormathematical transformations of their distribution, to one or morereference maps. The term “reference map” as used in connection withthese embodiments may have the same meaning described above with respectto methods of determining tissue(s) and/or cell type(s) giving rise tocfDNA in a subject's biological sample. In some embodiments, thereference map may comprise any one or more of: a DNase I hypersensitivesite dataset, an RNA expression dataset, a chromosome conformation map,a chromatin accessibility map, sequence data that is generated fromsamples obtained from at least one reference subject, enzyme-mediatedfragmentation data corresponding to at least one tissue that isassociated with a disease or a disorder, and/or positions of nucleosomesand/or chromatosomes in a tissue or cell type. In some embodiments, thereference map is generated by sequencing of cfDNA fragments from abiological sample from one or more individuals with a known disease. Insome embodiments, this biological sample from which the reference map isgenerated is collected from an animal to which human cells or tissueshave been xenografted.

In some embodiments, the reference map is generated by digestingchromatin with an exogenous nuclease (e.g., micrococcal nuclease). Insome embodiments, the reference maps comprise chromatin accessibilitydata determined by a transposition-based method (e.g., ATAC-seq). Insome embodiments, the reference maps comprise data associated withpositions of a DNA binding and/or DNA occupying protein for a tissue orcell type. In some embodiments, the DNA binding and/or DNA occupyingprotein is a transcription factor. In some embodiments, the positionsare determined chromatin immunoprecipitation of a crosslinkedDNA-protein complex. In some embodiments, the positions are determinedby treating DNA associated with the tissue or cell type with a nuclease(e.g., DNase-I).

In some embodiments, the reference map comprises a biological featurecorresponding to positions of a DNA binding or DNA occupying protein fora tissue or cell type. In some embodiments, the reference map comprisesa biological feature corresponding to quantitative expression of one ormore genes. In some embodiments, the reference map comprises abiological feature corresponding to the presence or absence of one ormore histone marks. In some embodiments, the reference map comprises abiological feature corresponding to hypersensitivity to nucleasecleavage.

In some embodiments, the step of determining the tissues and/or celltypes giving rise to the cfDNA comprises performing a mathematicaltransformation on a distribution of the genomic locations of at leastsome of the plurality of the cfDNA fragment endpoints. In someembodiments, the mathematical transformation includes a Fouriertransformation.

In some embodiments, the method further comprises determining a scorefor each of at least some coordinates of the reference genome, whereinthe score is determined as a function of at least the plurality of cfDNAfragment endpoints and their genomic locations, and wherein the step ofdetermining at least some of the tissues and/or cell types giving riseto the observed cfDNA fragments comprises comparing the scores to one ormore reference maps. The score may be any metric (e.g., a numericalranking or probability) which may be used to assign relative or absolutevalues to a coordinate of the reference genome. For example, the scoremay consist of, or be related to a probability, such as a probabilitythat the coordinate represents a location of a cfDNA fragment endpoint,or a probability that the coordinate represents a location of the genomethat is preferentially protected from nuclease cleavage by nucleosome orprotein binding. As another example, the score may relate to nucleosomespacing in particular regions of the genome, as determined by amathematical transformation of the distribution of cfDNA fragmentendpoints within that region. Such scores may be assigned to thecoordinate by any suitable means including, for example, by countingabsolute or relative events (e.g., the number of cfDNA fragmentendpoints) associated with that particular coordinate, or performing amathematical transformation on the values of such counts in the regionor a genomic coordinate. In some embodiments, the score for a coordinateis related to the probability that the coordinate is a location of acfDNA fragment endpoint. In other embodiments, the score for acoordinate is related to the probability that the coordinate representsa location of the genome that is preferentially protected from nucleasecleavage by nucleosome or protein binding. In some embodiments, thescore is related to nucleosome spacing in the genomic region of thecoordinate.

The term “score” as used in connection with these embodiments may havethe same meaning described above with respect to methods of determiningtissue(s) and/or cell type(s) giving rise to cfDNA in a subject'sbiological sample. In some embodiments, the score for a coordinate isrelated to the probability that the coordinate is a location of a cfDNAfragment endpoint. In other embodiments, the score for a coordinate isrelated to the probability that the coordinate represents a location ofthe genome that is preferentially protected from nuclease cleavage bynucleosome or protein binding. In some embodiments, the score is relatedto nucleosome spacing in the genomic region of the coordinate.

In some embodiments, the tissue or cell-type used for generating areference map is a primary tissue from a subject having a disease ordisorder. In some embodiments, the disease or disorder is selected fromthe group consisting of: cancer, normal pregnancy, a complication ofpregnancy (e.g., aneuploid pregnancy), myocardial infarction, systemicautoimmune disease, localized autoimmune disease, inflammatory boweldisease, allotransplantation with rejection, allotransplantation withoutrejection, stroke, and localized tissue damage.

In some embodiments, the tissue or cell type is a primary tissue from ahealthy subject.

In some embodiments, the tissue or cell type is an immortalized cellline.

In some embodiments, the tissue or cell type is a biopsy from a tumor.

In some embodiments, the reference map is based on sequence dataobtained from samples obtained from at least one reference subject. Insome embodiments, this sequence data defines positions of cfDNA fragmentendpoints within a reference genome—for example, if the reference map isgenerated by sequencing of cfDNA from subject(s) with known disease. Inother embodiments, this sequence data on which the reference map isbased may comprise any one or more of: a DNase I hypersensitive sitedataset, an RNA expression dataset, a chromosome conformation map, or achromatin accessibility map, or nucleosome positioning map generated bydigestion with micrococcal nuclease. In some embodiments, the referencesubject is healthy. In some embodiments, the reference subject has adisease or disorder. In some embodiments, the disease or disorder isselected from the group consisting of: cancer, normal pregnancy, acomplication of pregnancy (e.g., aneuploid pregnancy), myocardialinfarction, systemic autoimmune disease, inflammatory bowel disease,localized autoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.

In some embodiments, the reference map comprises cfDNA fragment endpointprobabilities, or a quantity that correlates with such probabilities,for at least a portion of the reference genome associated with thetissue or cell type. In some embodiments, the reference map comprises amathematical transformation of the cfDNA fragment endpointprobabilities, or a quantity that correlates with such probabilities.

In some embodiments, the reference map comprises scores for at least aportion of coordinates of the reference genome associated with thetissue or cell type. In some embodiments, the reference map comprises amathematical transformation of the scores, such as a Fouriertransformation of the scores. In some embodiments, the scores are basedon annotations of reference genomic coordinates for the tissue or celltype. In some embodiments, the scores are based on positions ofnucleosomes and/or chromatosomes. In some embodiments, the scores arebased on transcription start sites and/or transcription end sites. Insome embodiments, the scores are based on predicted binding sites of atleast one transcription factor. In some embodiments, the scores arebased on predicted nuclease hypersensitive sites.

In some embodiments, the scores are associated with at least oneorthogonal biological feature. In some embodiments, the orthogonalbiological feature is associated with highly expressed genes. In someembodiments, the orthogonal biological feature is associated with lowlyexpression genes.

In some embodiments, at least some of the plurality of the scores eachhas a score above a threshold value. In such embodiments, scores fallingbelow the threshold (minimum) value are excluded from the step ofcomparing the scores to a reference map. In some embodiments, thethreshold value is determined before determining the tissue(s) and/orthe cell type(s) giving rise to the cfDNA. In other embodiments, thethreshold value is determined after determining the tissue(s) and/or thecell type(s) giving rise to the cfDNA.

In some embodiments, the step of determining the tissues and/or celltypes giving rise to the cfDNA as a function of a plurality of thegenomic locations of at least some of the cfDNA fragment endpointscomprises a mathematical transformation of the distribution of thegenomic locations of at least some of the cfDNA fragment endpoints ofthe sample with one or more features of one or more reference maps.

In some embodiments, this mathematical transformation includes a Fouriertransformation.

In some embodiments, the reference map comprises enzyme-mediatedfragmentation data corresponding to at least one tissue that isassociated with the disease or disorder.

In some embodiments, the reference genome is associated with a human.

In one aspect of the invention, the methods described herein are usedfor detection, monitoring and tissue(s) and/or cell-type(s)-of-originassessment of malignancies from analysis of cfDNA in bodily fluids. Itis now well documented that in patients with malignancies, a portion ofcfDNA in bodily fluids such as circulating plasma can be derived fromthe tumor. The methods described here can potentially be used to detectand quantify this tumor derived portion. Furthermore, as nucleosomeoccupancy maps are cell-type specific, the methods described here canpotentially be used to determine the tissue(s) and/orcell-type(s)-of-origin of a malignancy. Also, as noted above, it hasbeen observed that there is a major increase in the concentration ofcirculating plasma cfDNA in cancer, potentially disproportionate to thecontribution from the tumor itself. This suggests that other tissues(e.g. stromal, immune system) may possibly be contributing tocirculating plasma cfDNA during cancer. To the extent that contributionsfrom such other tissues to cfDNA are consistent between patients for agiven type of cancer, the methods described above may enable cancerdetection, monitoring, and/or tissue(s) and/or cell-type(s)-of-originassignment based on signal from these other tissues rather than thecancer cells per se.

In another aspect of the invention, the methods described herein areused for detection, monitoring and tissue(s) and/orcell-type(s)-of-origin assessment of tissue damage from analysis ofcfDNA in bodily fluids. It is to be expected that many pathologicalprocesses will result in a portion of cfDNA in bodily fluids such ascirculating plasma deriving from damaged tissues. The methods describedhere can potentially be used to detect and quantify cfDNA derived fromtissue damage, including identifying the relevant tissues and/orcell-types of origin. This may enable diagnosis and/or monitoring ofpathological processes including myocardial infarction (acute damage ofheart tissue), autoimmune disease (chronic damage of diverse tissues),and many others involving either acute or chronic tissue damage.

In another aspect of the invention, the methods described herein areused for estimating the fetal fraction of cfDNA in pregnancy and/orenhancing detection of chromosomal or other genetic abnormalities.Relatively shallow sequencing of the maternal plasma-borne DNAfragments, coupled with nucleosome maps described above, may allow acost-effective and rapid estimation of fetal fraction in both male andfemale fetus pregnancies. Furthermore, by enabling non-uniformprobabilities to be assigned to individual sequencing reads with respectto their likelihood of having originated from the maternal or fetalgenome, these methods may also enhance the performance of tests directedat detecting chromosomal aberrations (e.g. trisomies) through analysisof cfDNA in maternal bodily fluids.

In another aspect of the invention, the methods described herein areused for quantifying the contribution of a transplant (autologous orallograft) to cfDNA—Current methods for early and noninvasive detectionof acute allograft rejection involve sequencing plasma-borne DNA andidentifying increased concentrations of fragments derived from the donorgenome. This approach relies on relatively deep sequencing of this poolof fragments to detect, for example, 5-10% donor fractions. An approachbased instead on nucleosome maps of the donated organ may enable similarestimates with shallower sequencing, or more sensitive estimates with anequivalent amount of sequencing. Analogous to cancer, it is alsopossible that cell types other than the transplant itself contribute tocfDNA composition during transplant rejection. To the extent thatcontributions from such other tissues to cfDNA are consistent betweenpatients during transplant rejection, the methods described above mayenable monitoring of transplant rejection based on signal from theseother tissues rather than the transplant donor cells per se.

Additional Embodiments of the Present Disclosure.

The present disclosure also provides methods of diagnosing a disease ordisorder using nucleosome reference map(s) generated from subjectshaving a known disease or disorder. In some such embodiments, the methodcomprises: (1) generating a reference set of nucleosome maps, whereineach nucleosome map is derived from either cfDNA from bodily fluids ofindividual(s) with defined clinical conditions (e.g. normal, pregnancy,cancer type A, cancer type B, etc.) and/or DNA derived from digestion ofchromatin of specific tissues and/or cell types; (2) predicting theclinical condition and/or tissue/cell-type-of-origin composition ofcfDNA from bodily fluids of individual(s) by comparing a nucleosome mapderived from their cfDNA to the reference set of nucleosome maps.

STEP 1: Generating a reference set of nucleosome maps, and aggregatingor summarizing signal from nucleosome positioning.

A preferred method for generating a nucleosome map includes DNApurification, library construction (by adaptor ligation and possibly PCRamplification) and massively parallel sequencing of cfDNA from a bodilyfluid. An alternative source for nucleosome maps, which are useful inthe context of this invention as reference points or for identifyingprincipal components of variation, is DNA derived from digestion ofchromatin with micrococcal nuclease (MNase), DNase treatment, ATAC-Seqor other related methods wherein information about nucleosomepositioning is captured in distributions (a), (b) or (c). Descriptionsof these distributions (a), (b) and (c) are provided above in [0078] andare shown graphically in FIG. 1.

In principle, very deep sequencing of such libraries can be used toquantify nucleosome occupancy in the aggregate cell types contributingto cfDNA at specific coordinates in the genome, but this is veryexpensive today. However, the signal associated with nucleosomeoccupancy patterns can be summarized or aggregated across continuous ordiscontinuous regions of the genome. For example, in Examples 1 and 2provided herein, the distribution of sites in the reference human genometo which sequencing read start sites map, i.e. distribution (a), issubjected to Fourier transformation in 10 kilobase-pair (kbp) contiguouswindows, followed by quantitation of intensities for frequency rangesthat are associated with nucleosome occupancy. This effectivelysummarizes the extent to which nucleosomes exhibit structuredpositioning within each 10 kbp window. In Example 3 provided herein, wequantify the distribution of sites in the reference human genome towhich sequencing read start sites map, i.e. distribution (a), in theimmediate vicinity of transcription factor binding sites (TFBS) ofspecific transcription factor (TF), which are often immediately flankedby nucleosomes when the TFBS is bound by the TF. This effectivelysummarizes nucleosome positioning as a consequence of TF activity in thecell type(s) contributing to cfDNA. Importantly, there are many relatedways in which nucleosome occupancy signals can be meaningfullysummarized. These include aggregating signal from distributions (a),(b), and/or (c) around other genomic landmarks such as DNaselhypersensitive sites, transcription start sites, topological domains,other epigenetic marks or subsets of all such sites defined bycorrelated behavior in other datasets (e.g. gene expression, etc.). Assequencing costs continue to fall, it will also be possible to directlyuse maps of nucleosome occupancy, including those generated from cfDNAsamples associated with a known disease, as reference maps, i.e. withoutaggregating signal, for the purposes of comparison to an unknown cfDNAsample. In some embodiments, this biological sample from which thereference map of nucleosome occupancy is generated is collected from ananimal to which human cells or tissues have been xenografted. Theadvantage of this is that sequenced cfDNA fragments mapping to the humangenome will exclusively derive from the xenografted cells or tissues, asopposed to representing a mixture of cfDNA derived from thecells/tissues of interest along with hematopoietic lineages.

STEP 2: Predicting pathology(s), clinical condition(s) and/ortissue/cell-types-of-origin composition on the basis of comparing thecfDNA-derived nucleosome map of one or more new individuals/samples tothe reference set of nucleosome maps either directly or aftermathematical transformation of each map.

Once one has generated a reference set of nucleosome maps, there are avariety of statistical signal processing methods for comparingadditional nucleosome map(s) to the reference set. In Examples 1 & 2, wefirst summarize long-range nucleosome ordering within 10 kbp windowsalong the genome in a diverse set of samples, and then perform principalcomponents analysis (PCA) to cluster samples (Example 1) or to estimatemixture proportions (Example 2). Although we know the clinical conditionof all cfDNA samples and the tissue/cell-type-of-origin of all cell linesamples used in these Examples, any one of the samples could inprinciple have been the “unknown”, and its behavior in the PCA analysisused to predict the presence/absence of a clinical condition or itstissue/cell-type-of-origin based on its behavior in the PCA analysisrelative to all other nucleosome maps.

The unknown sample does not necessarily need to be precisely matched to1+ members of the reference set in a 1:1 manner. Rather, itssimilarities to each can be quantified (Example 1), or its nucleosomemap can be modeled as a non-uniform mixture of 2+ samples from thereference set (Example 2).

The tissue/cell-type-of-origin composition of cfDNA in each sample neednot be predicted or ultimately known for the method of the presentinvention to be successful. Rather, the method described herein relieson the consistency of tissue/cell-type-of-origin composition of cfDNA inthe context of a particular pathology or clinical condition. However, bysurveying the nucleosome maps of a large number of tissues and/or celltypes directly by analysis of DNA derived from digestion of chromatinand adding these to the nucleosome map, it would be possible to estimatethe tissue(s) and/or cell-type(s) contributing to an unknowncfDNA-derived sample.

In any embodiment described herein, the method may further comprisegenerating a report comprising a statement identifying the disease ordisorder. In some embodiments, the report may further comprise a list ofthe determined tissues and/or cell types giving rise to the isolatedcfDNA. In some embodiments, the report further comprises a list ofdiseases and/or disorders which are unlikely to be associated with thesubject. The report may optionally further include any other informationabout the sample and/or the subject, the type of biological sample, thedate the biological sample was obtained from the subject, the date thecfDNA isolation step was performed and/or tissue(s) and/or cell type(s)which likely did not give rise to any cfDNA isolated from the biologicalsample.

In some embodiments, the report further includes a recommended treatmentprotocol including, for example and without limitation, a suggestion toobtain an additional diagnostic test from the subject, a suggestion tobegin a therapeutic regimen, a suggestion to modify an existingtherapeutic regimen with the subject, and/or a suggestion to suspend orstop an existing therapeutic regiment.

EXAMPLES Example 1 Principal Components Analysis of Cell Free DNANucleosome Maps

The distribution of read start positions in sequencing data derived fromcfDNA extractions and MNase digestion experiments were examined toassess the presence of signals related to nucleosome positioning. Forthis purpose, a pooled cfDNA sample (human plasma containingcontributions from an unknown number of healthy individuals;bulk.cfDNA), a cfDNA sample from single healthy male control individual(MC2.cfDNA), four cfDNA samples from patients with intracranial tumors(tumor.2349, tumor.2350, tumor.2351, tumor.2353), six MNase digestionexperiments from five different human cell lines (Hap1.MNase,HeLa.MNase, HEK.MNase, NA12878.MNase, HeLaS3, MCF.7) and seven cfDNAsamples from different pregnant female individuals (gm1matplas,gm2matplas, im1matplas, fgs002, fgs003, fgs004, fgs005) were analyzedand contrasted with regular shotgun sequencing data set of DNA extractedfrom a female lymphoblastoid cell line (NA12878). A subset of the pooledcfDNA sample (26%, bulk.cfDNA_part) and of the single healthy malecontrol individual (18%, MC2.cfDNA_part) were also included, as separatesamples, to explore the effect of sequencing depth.

Read start coordinates were extracted and periodograms were createdusing Fast Fourier Transformation (FFT) as described in the Methodssection. This analysis determines how much of the non-uniformity in thedistribution of read start sites can be explained by signals of specificfrequencies/periodicities. We focused on a range of 120-250 bp, whichcomprises the length range of DNA wrapped around a single nucleosome(147 bp) as well as additional sequence of the nucleosome linkersequence (10-80 bp). FIG. 3 shows the average intensities for eachfrequency across all blocks of human chromosome 1 and human chromosome22. It can be seen that MNase digestion experiments as well cfDNAsamples show clear peaks below 200 bp periodicity. Such a peak is notobserved in the human shotgun data. These analyses are consistent with amajor effect of nucleosome positioning on the distribution of fragmentboundaries in cfDNA.

Variation in the exact peak frequency between samples was also observed.This is possibly a consequence of different distributions of linkersequence lengths in each cell type. That the peak derives from patternsof nucleosome bound DNA plus linker sequence is supported by theobservations that the flanks around the peaks are not symmetrical andthat the intensities for frequencies higher than the peak compared tofrequencies lower than the peak are lower. This suggests that plotssimilar to those presented in FIG. 3 can be used to perform qualitycontrol of cfDNA and MNase sequencing data. Random fragmentation orcontamination of cfDNA and MNase with regular (shotgun) DNA will causedilution or, in extreme cases, total absence of these characteristicintensity patterns in periodograms.

In the following, data were analyzed based on measured intensities at aperiodicity of 196 bp as well as all intensities determined for thefrequency range of 181 bp to 202 bp. A wider frequency range was chosenin order to provide higher resolution because a wider range of linkerlengths are being captured. These intensities were chosen as the focuspurely for computational reasons here; different frequency ranges may beused in related embodiments. FIGS. 4 and 5, explore visualizations ofthe periodogram intensities at 196 bp across contiguous, non-overlapping10 kbp blocks tiling the full length of human autosomes (see Methods fordetails). FIG. 4 presents a Principal Component Analysis (PCA) of thedata and the projections across the first three components. Principalcomponent 1 (PC1) (28.1% of variance) captures the differences inintensity strength seen in FIG. 3 and thereby separates MNase and cfDNAsamples from genomic shotgun data. In contrast, PC2 (9.7% of variance)captures the differences between MNase and cfDNA samples. PC3 (6.4%variance) captures differences between individual samples. FIG. 5 showsthe hierarchical clustering dendogram of this data based on Euclideandistances of the intensity vectors. We note that the two HeLa S3experiments tightly cluster in the PCA and dendogram, even though datawas generated in different labs and following different experimentalprotocols. “Normal” cfDNA samples, tumor cfDNA samples and groups ofcell line MNase samples also clustered. Specifically, the three tumorsamples originating from the same tumor type (glioblastoma multiforme)appear to cluster, separately from tumor.2351 sample which originatesfrom a different tumor type (see Table 1). The GM1 and IM1 samplescluster separately from the other cfDNA samples obtained from pregnantwomen. This coincides with higher intensities observed for frequenciesbelow the peak in these samples (i.e., a more pronounced left shoulderin FIG. 3). This might indicate subtle differences in the preparation ofthe cfDNA between the two sets of samples, or biological differenceswhich were not controlled for (e.g., gestational age).

FIGS. 6 and 7 show the results of equivalent analyses but based on thefrequency range of 181 bp to 202 bp. Comparing these plots, the resultsare largely stable to a wider frequency range; however additionalfrequencies may improve sensitivity in more fine-scaled analyses. Tofurther explore cell-type origin specific patterns, the cfDNA and MNasedata sets were analyzed separately using PCA of intensities for thisfrequency range. In the following set of analyses, the five cfDNAsamples from pregnant women, which show the pronounced left shoulder inFIG. 3, were excluded. FIG. 8 shows the first 7 principal components ofthe cfDNA data and FIG. 9 all six principal components for the six MNasedata sets. While there is a clustering of related samples, there is alsoconsiderable variation (biological and technical variation) to separateeach sample from the rest. For example, an effect of sequencing depthwas observable, as can be seen from the separation of bulk.cfDNA andbulk.cfDNA_part as well as MC2.cfDNA and MC2.cfDNA_part. Read samplingmay be used to correct for this technical confounder.

Some key observations of this example include:

1) Read start coordinates in cfDNA sequencing data capture a strongsignal of nucleosome positioning.

2) Differences in the signal of nucleosome positioning, aggregatedacross subsets of the genome such as contiguous 10 kbp windows,correlate with sample origin.

Example 2 Mixture Proportion Estimation of Nucleosome Maps

In Example 1, basic clustering of samples that were generated ordownloaded from public databases was studied. The analyses showed thatread start coordinates in these data sets capture a strong signal ofnucleosome positioning (across a range of sequencing depths obtainedfrom 20 million sequences to more than a 1,000 million sequences) andthat sample origin correlates with this signal. For the goals of thismethod, it would also be useful to be able to identify mixtures of knowncell types and to some extent quantify the contributions of each celltype from this signal. For this purpose, this example explored syntheticmixtures (i.e., based on sequence reads) of two samples. We mixedsequencing reads in ratios of 5:95, 10:90, 15:85, 20:80, 30:70, 40:60,50:50, 60:40, 30:70, 80:20, 90:10 and 95:5 for two MNase data sets(MCF.7 and NA12878.MNase) and two cfDNA data sets (tumor.2349 andbulk.cfDNA). The synthetic MNase mixture datasets were drawn from twosets of 196.9 million aligned reads (each from one of the originalsamples) and the synthetic cfDNA mixture datasets were drawn from twosets of 181.1 million aligned reads (each from one of the originalsamples).

FIG. 10 shows the average intensities for chromosome 11, equivalent toFIG. 3 but for these synthetic mixtures. It can be seen from FIG. 10 howthe different sample contributions cause shifts in the global frequencyintensity patterns. This signal can be exploited to infer the syntheticmixture proportions. FIG. 11 shows the first two principal componentsfor the MNase data set mixtures and FIG. 12 shows the first twoprincipal components for the cfDNA data set mixtures. In both cases, thefirst PC directly captures the composition of the mixed data set. It istherefore directly conceivable how mixture proportions for two andpossibly more cell types could be estimated from transformation of thefrequency intensity data given the appropriate reference sets and usingfor example regression models. FIG. 13 shows the dendogram of both datasets, confirming the overall similarities of mixture samples derivingfrom similar sample proportions as well as the separation of the cfDNAand MNase samples.

One of the key observations of this example is that the mixtureproportions of various sample types (cfDNA or cell/tissue types) to anunknown sample can be estimated by modeling of nucleosome occupancypatterns.

Example 3 Measuring Nucleosome Occupancy Relative to TranscriptionFactor Binding Sites with cfDNA Sequencing Data

While previous examples demonstrate that signals of nucleosomepositioning can be obtained by partitioning the genome into contiguous,non-overlapping 10 kbp windows, orthogonal methods can also be used togenerate cleavage accessibility maps and may be less prone to artifactsbased on window size and boundaries. One such method, explored in somedetail in this Example, is the inference of nucleosome positioningthrough observed periodicity of read-starts around transcription factor(TF) binding sites.

It is well established that local nucleosome positioning is influencedby nearby TF occupancy. The effect on local remodeling of chromatin, andthus on the stable positioning of nearby nucleosomes, is not uniformacross the set of TFs; occupancy of a given TF may have local effects onnucleosome positioning that are preferentially 5′ or 3′ of the bindingsite and stretch for greater or lesser genomic distance in specific celltypes. Furthermore, and importantly for the purposes of this disclosure,the set of TF binding sites occupied in vivo in a particular cell variesbetween tissues and cell types, such that if one were able to identifyTF binding site occupancy maps for tissues or cell types of interest,and repeated this process for one or more TFs, one could identifycomponents of the mixture of cell types and tissues contributing to apopulation of cfDNA by identifying enrichment or depletion of one ormore cell type- or tissue-specific TF binding site occupancy profiles.

To demonstrate this idea, read-starts in the neighborhood of TF bindingsites were used to visually confirm cleavage biases reflective ofpreferential local nucleosome positioning. ChIP-seq transcription factor(TF) peaks were obtained from the Encyclopedia of DNA Elements(“ENCODE”) project (National Human Genome Research Institute, NationalInstitutes of Health, Bethesda, Md.). Because the genomic intervals ofthese peaks are broad (200 to 400 bp on average), the active bindingsites within these intervals were discerned by informatically scanningthe genome for respective binding motifs with a conservative p-valuecutoff (1×10⁻⁵, see Methods for details). The intersection of these twoindependently derived sets of predicted TF binding sites were thencarried forward into downstream analysis.

The number of read-starts at each position within 500 bp of eachcandidate TF binding site was calculated in samples with at least 100million sequences. Within each sample, all read-starts were summed ateach position, yielding a total of 1,014 to 1,019 positions per sampleper TF, depending on the length of the TF recognition sequence.

FIG. 14 shows the distribution of read-starts around 24,666 CTCF bindingsites in the human genome in a variety of different samples, centeredaround the binding site itself. CTCF is an insulator binding protein andplays a major role in transcriptional repression. Previous studiessuggest that CTCF binding sites anchor local nucleosome positioning suchthat at least 20 nucleosomes are symmetrically and regularly spacedaround a given binding site, with an approximate period of 185 bp. Onestriking feature common to nearly all of the samples in FIG. 14 is theclear periodicity of nucleosome positioning both upstream and downstreamof the binding site, suggesting that the local and largely symmetricaleffects of CTCF binding in vivo are recapitulated in a variety of cfDNAand MNase-digested samples. Intriguingly, the periodicity of theupstream and downstream peaks is not uniform across the set of samples;the MNase-digested samples display slightly wider spacing of the peaksrelative to the binding site, suggesting the utility of not only theintensity of the peaks, but also their period.

FIG. 15 shows the distribution of read-starts around 5,644 c-Jun bindingsites. While the familiar periodicity is again visually identifiable forseveral samples in this figure, the effect is not uniform. Of note,three of the MNase-digested samples (Hap1.MNase, HEK.MNase, andNA12878.MNase) have much flatter distributions, which may indicate thatc-Jun binding sites are not heavily occupied in these cells, or that theeffect of c-Jun binding on local chromatin remodeling is less pronouncedin these cell types. Regardless of the underlying mechanism, theobservation that bias in the local neighborhood of read-starts variesfrom TF to TF and between sample types reinforces the potential role forread start-based inference of nucleosome occupancy for correlating ordeconvoluting tissue-of-origin composition in cfDNA samples.

FIG. 16 shows the distribution of read-starts around 4,417 NF-YB bindingsites. The start site distributions in the neighborhood of these TFbinding sites demonstrate a departure from symmetry: here, thedownstream effects (to the right within each plot) appear to be strongerthan the upstream effects, as evidenced by the slight upward trajectoryin the cfDNA samples. Also of note is the difference between theMNase-digested samples and the cfDNA samples: the former show, onaverage, a flatter profile in which peaks are difficult to discern,whereas the latter have both more clearly discernable periodicity andmore identifiable peaks.

Methods for Examples 1-3 Clinical and Control Samples

Whole blood was drawn from pregnant women fgs002, fgs003, fgs004, andfgs005 during routine third-trimester prenatal care and stored brieflyin Vacutainer tubes containing EDTA (BD). Whole blood from pregnantwomen IM1, GM1, and GM2 was obtained at 18, 13, and 10 weeks gestation,respectively, and stored briefly in Vacutainer tubes containing EDTA(BD). Whole blood from glioma patients 2349, 2350, 2351, and 2353 wascollected as part of brain surgical procedures and stored for less thanthree hours in Vacutainer tubes containing EDTA (BD). Whole blood fromMale Control 2 (MC2), a healthy adult male, was collected in Vacutainertubes containing EDTA (BD). Four to ten ml of blood was available foreach individual. Plasma was separated from whole blood by centrifugationat 1,000×g for 10 minutes at 4° C., after which the supernatant wascollected and centrifuged again at 2,000×g for 15 minutes at 4° C.Purified plasma was stored in 1 ml aliquots at −80° C. until use.

Bulk human plasma, containing contributions from an unknown number ofhealthy individuals, was obtained from STEMCELL Technologies (Vancouver,British Columbia, Canada) and stored in 2 ml aliquots at −80° C. untiluse.

Processing of Plasma Samples

Frozen plasma aliquots were thawed on the bench-top immediately beforeuse. Circulating cfDNA was purified from 2 ml of each plasma sample withthe QiaAMP Circulating Nucleic Acids kit (Qiagen, Venlo, Netherlands) asper the manufacturer's protocol. DNA was quantified with a Qubitfluorometer (Invitrogen, Carlsbad, Calif.) and a custom qPCR assaytargeting a human Alu sequence.

MNase Digestions

Approximately 50 million cells of each line (GM12878, HeLa S3, HEK,Hap1) were grown using standard methods. Growth media was aspirated andcells were washed with PBS. Cells were trypsinized and neutralized with2× volume of CSS media, then pelleted in conical tubes by centrifugationfor at 1,300 rpm for 5 minutes at 4° C. Cell pellets were resuspended in12 ml ice-cold PBS with 1× protease inhibitor cocktail added, counted,and then pelleted by centrifugation for at 1,300 rpm for 5 minutes at 4°C. Cell pellets were resuspended in RSB buffer (10 mM Tris-HCl, 10 mMNaCl, 3 mM MgCl₂, 0.5mM spermidine, 0.02% NP-40, 1× protease inhibitorcocktail) to a concentration of 3 million cells per ml and incubated onice for 10 minutes with gentle inversion. Nuclei were pelleted bycentrifugation at 1,300 rpm for five minutes at 4° C. Pelleted nucleiwere resuspended in NSB buffer (25% glycerol, 5 mM MgAc₂, 5 mM HEPES,0.08 mM EDTA, 0.5 mM spermidine, 1 mM DTT, 1× protease inhibitorcocktail) to a final concentration of 15M per ml. Nuclei were againpelleted by centrifugation at 1,300 rpm for 5 minutes at 4° C., andresuspended in MN buffer (500 mM Tris-HCl, 10 mM NaCl, 3 mM MgCl₂, 1 mMCaCl, 1× protease inhibitor cocktail) to a final concentration of 30Mper ml. Nuclei were split into 200 μl aliquots and digested with 4 U ofmicrococcal nuclease (Worthington Biochemical Corp., Lakewood, N.J.,USA) for five minutes at 37° C. The reaction was quenched on ice withthe addition of 85 82 l of MNSTOP buffer (500 mM NaCl, 50 mM EDTA, 0.07%NP-40, 1× protease inhibitor), followed by a 90 minute incubation at 4°C. with gentle inversion. DNA was purified usingphenol:chloroform:isoamyl alcohol extraction. Mononucleosomal fragmentswere size selected with 2% agarose gel electrophoresis using standardmethods and quantified with a Nanodrop spectrophotometer (Thermo FisherScientific Inc., Waltham, Mass., USA).

Preparation of Sequencing Libraries

Barcoded sequencing libraries for all samples were prepared with theThruPLEX-FD or ThruPLEX DNA-seq 48 D kits (Rubicon Genomics, Ann Arbor,Mich.), comprising a proprietary series of end-repair, ligation, andamplification reactions. Between 3.0 and 10.0 ng of DNA were used asinput for all clinical sample libraries. Two bulk plasma cfDNA librarieswere constructed with 30 ng of input to each library; each library wasseparately barcoded. Two libraries from MC2 were constructed with 2 ngof input to each library; each library was separately barcoded.Libraries for each of the MNase-digested cell lines were constructedwith 20 ng of size-selected input DNA. Library amplification for allsamples was monitored by real-time PCR to avoid over-amplification.

Sequencing

All libraries were sequenced on HiSeq 2000 instruments (Illumina, Inc.,San Diego, Calif., USA) using paired-end 101 bp reads with an index readof 9 bp. One lane of sequencing was performed for pooled samples fgs002,fgs003, fgs004, and fgs005, yielding a total of approximately 4.5×10⁷read-pairs per sample. Samples IM1, GM1, and GM2 were sequenced acrossseveral lanes to generate 1.2×10⁹, 8.4×10⁸, and 7.6×10⁷ read-pairs,respectively. One lane of sequencing was performed for each of samples2349, 2350, 2351, and 2353, yielding approximately 2.0×10⁸ read-pairsper sample. One lane of sequencing was performed for each of the fourcell line MNase-digested libraries, yielding approximately 2.0×10⁸read-pairs per library. Four lanes of sequencing were performed for oneof the two replicate MC2 libraries and three lanes for one of the tworeplicate bulk plasma libraries, yielding a total of 10.6×10⁹ and7.8×10⁸ read-pairs per library, respectively.

Processing of cfDNA Sequencing Data

DNA insert sizes for both cfDNA and MNase libraries tend be short(majority of data between 80 bp and 240 bp); adapter sequence at theread ends of some molecules were therefore expected. Adapter sequencesstarting at read ends were trimmed, and forward and reverse read ofpaired end (“PE”) data for short original molecules were collapsed intosingle reads (“SRs”); PE reads that overlap with at least 11 bp readswere collapsed to SRs. The SRs shorter than 30 bp or showing more than 5bases with a quality score below 10 were discarded. The remaining PE andSR data were aligned to the human reference genome (GRCh37, 1000 Grelease v2) using fast alignment tools (BWA-ALN or BWA-MEM). Theresulting SAM (Sequence Alignment/Map) format was converted to sortedBAM (Binary Sequence Alignment/Map format) using SAMtools.

Additional Publically Available Data

Publically available PE data of Hela-S3 MNase (accessions SRR633612,SRR633613) and MCF-7 MNase experiments (accessions SRR999659-SRR999662)were downloaded and processed as described above.

Publicly available genomic shotgun sequencing data of the CEPH pedigree146 individual NA12878 generated by Illumina Cambridge Ltd. (Essex, UK)was obtained from the European Nucleotide Archive (ENA, accessionsERR174324-ERR174329). This data was PE sequenced with 2x101bp reads onthe Illumina HiSeq platform and the libraries were selected for longerinsert sizes prior to sequencing. Thus, adapter sequence at the readends were not expected; this data was therefore directly aligned usingBWA-MEM.

Extracting Read End Information

PE data provides information about the two physical ends of DNAmolecules used in sequencing library preparation. This information wasextracted using the SAMtools application programming interface (API)from BAM files. Both outer alignment coordinates of PE data for whichboth reads aligned to the same chromosome and where reads have oppositeorientations were used. For non-trimmed SR data, only one read endprovides information about the physical end of the original DNAmolecule. If a read was aligned to the plus strand of the referencegenome, the left-most coordinate was used. If a read was aligned to thereverse strand, its right-most coordinate was used instead. In caseswhere PE data was converted to single read data by adapter trimming,both end coordinates were considered. Both end coordinates were alsoconsidered if at least five adapter bases were trimmed from a SRsequencing experiment.

For all autosomes in the human reference sequence (chromosomes 1 to 22),the number of read ends and the coverage at all positions were extractedin windows of 10,000 bases (blocks). If there were no reads aligning ina block, the block was considered empty for that specific sample.

Smooth Periodograms

The ratio of read-starts and coverage was calculated for each non-emptyblock of each sample. If the coverage was 0, the ratio was set to 0.These ratios were used to calculate a periodogram of each block usingFast Fourier Transform (FFT, spec.pgram in the R statistical programmingenvironment) with frequencies between 1/500 bases and 1/100 bases.Optionally, parameters to smooth (3 bp Daniell smoother; moving averagegiving half weight to the end values) and detrend the data (e.g.,subtract the mean of the series and remove a linear trend) were used.Intensities for the frequency range 120-250 bp for each block weresaved.

Average Chromosome Intensities

For a set of samples, blocks that were non-empty across all samples wereidentified. The intensities for a specific frequency were averagedacross all blocks of each sample for each autosome.

Principal Component Analysis and Dendograms

Blocks that were non-empty across samples were collected. Principalcomponent analysis (PCA; prcomp in the R statistical programmingenvironment) was used to reduce the dimensionality of the data and toplot it in two-dimensional space. PCA identifies the dimension thatcaptures most variation of the data and constructs orthogonaldimensions, explaining decreasing amounts of variation in the data.

Pair-wise Euclidean distances between sample intensities were calculatedand visualized as dendograms (stats library in the R statisticalprogramming environment).

Transcription Factor Binding Site Predictions

Putative transcription factor binding sites, obtained through analysisof ChlP-seq data generated across a number of cell types, was obtainedfrom the ENCODE project.

An independent set of candidate transcription factor binding sites wasobtained by scanning the human reference genome (GRCh37, 1000 G releasev2) with the program fimo from the MEME software package (version4.10.0_1). Scans were performed using positional weight matricesobtained from the JASPAR_CORE_2014_vertebrates database, using options“-verbosity 1-thresh 1e-5”. Transcription factor motif identifiers usedwere MA0139.1, MA0502.1, and MA0489.1.

Chromosomal coordinates from both sets of predicted sites wereintersected with bedtools v2.17.0. To preserve any asymmetry in theplots, only predicted binding sites on the “+” strand were used.Read-starts were tallied for each sample if they fell within 500 bp ofeither end of the predicted binding site, and summed within samples byposition across all such sites. Only samples with at least 100 milliontotal reads were used for this analysis.

Example 4 Determining Normal/Healthy Tissue(s)-of-Origin from cfDNA

To evaluate whether fragmentation patterns observed in a singleindividual's cfDNA might contain evidence of the genomic organization ofthe cells giving rise to these fragments—and thus, of thetissue(s)-of-origin of the population of cfDNA molecules—even when thereare no genotypic differences between contributing cell types, cfDNA wasdeeply sequenced to better understand the processes that give rise toit. The resulting data was used to build a genome-wide map of nucleosomeoccupancy that built on previous work by others, but is substantiallymore comprehensive. By optimizing library preparation protocols torecover short fragments, it was discovered that the in vivo occupanciesof transcription factors (TFs) such as CTCF are also directlyfootprinted by cfDNA. Finally, it was discovered that nucleosome spacingin regulatory elements and gene bodies, as revealed by cfDNA sequencingin healthy individuals, correlates most strongly with DNasehypersensitivity and gene expression in lymphoid and myeloid cell lines.

cfDNA Fragments Correspond to Chromatosomes and Contain Substantial DNADamage

Conventional sequencing libraries were prepared by end-repair andadaptor ligation to cfDNA fragments purified from plasma pooled from anunknown number of healthy individuals (“BH01”) or plasma from a singleindividual (“IH01”) (FIG. 17; Table 1):

TABLE 1 Sequencing Statistics for Plasma Samples. Sample LibraryFragments Aligned Est. % name type Reads sequenced Aligned Q30 Coverageduplicates 35-80 bp 120-180 bp BH01 DSP 2x101 1489569204 97.20% 88.85%96.32 6.00% 0.65% 57.64% IH01 DSP 2x101 1572050374 98.58% 90.60% 104.9221.00% 0.77% 47.83% IH02 SSP 2x50, 779794090 93.19% 75.27% 30.08 20.05%21.83% 44.00% 43/42 CH01 — — 3841413668 96.95% 86.81% 231.32 14.99%5.00% 50.85% SSP, single-stranded library preparation protocol. DSP,double-stranded library preparation protocol.

For each sample, sequencing-related statistics, including the totalnumber of fragments sequenced, read lengths, the percentage of suchfragments aligning to the reference with and without a mapping qualitythreshold, mean coverage, duplication rate, and the proportion ofsequenced fragments in two length bins, were tabulated. Fragment lengthwas inferred from alignment of paired-end reads. Due to the short readlengths, coverage was calculated by assuming the entire fragment hadbeen read. The estimated number of duplicate fragments was based onfragment endpoints, which may overestimate the true duplication rate inthe presence of highly stereotyped cleavage. SSP, single-strandedlibrary preparation protocol. DSP, double-stranded library preparationprotocol.

Libraries BH01 and IH01 were sequenced to 96- and 105-fold coverage,respectively (1.5 G and 1.6 G fragments). The fragment lengthdistributions, inferred from alignment of paired-end reads, have adominant peak at ˜167 bp (coincident with the length of DNA associatedwith a chromatosome), and ˜10.4 bp periodicity in the 100-160 bp lengthrange (FIG. 18). These distributions are consistent with a model inwhich cfDNA fragments are preferentially protected from nucleasecleavage both pre- and post-cell death by association with proteins—inthis case, by the nucleosome core particle and linker histone—but wheresome degree of additional nicking or cleavage occurs in relation to thehelical pitch of nucleosome-bound DNA. Further supporting this model isthe dinucleotide composition of these 167 bp fragments, whichrecapitulate key features of earlier studies of MNase-derived,nucleosome-associated fragments (e.g. bias against A/T dinucleotides atthe dyad) and support the notion that the nucleosome core particle issymmetrically positioned with respect to the chromatosome (FIG. 19).

A prediction of this model of cfDNA ontology is widespread DNA damage,e.g. single-strand nicks as well as 5′ and 3′ overhangs. Duringconventional library preparation, nicked strands are not amplified,overhangs are blunted by end-repair, and short double stranded DNA(“dsDNA”) molecules, which may represent a substantial proportion oftotal cfDNA, may simply be poorly recovered. To address this, asingle-stranded sequencing library from plasma-borne cfDNA derived froman additional healthy individual (‘IH02’) was prepared using a protocoladapted from studies of ancient DNA by Gansauge, et al., wherewidespread DNA damage and nuclease cleavage around nucleosomes have beenreported. Briefly, cfDNA was denatured and a biotin-conjugated,single-stranded adaptor was ligated to the resulting fragments. Theligated fragments were then subjected to second-strand synthesis,end-repair and ligation of a second adaptor while the fragments wereimmobilized to streptavidin beads. Finally, minimal PCR amplificationwas performed to enrich for adaptor-bearing molecules while alsoappending a sample index (FIG. 20; Table 2).

TABLE 2Synthetic oligos used in preparation of single stranded sequencinglibraries. Oligo Name Sequence (5′-3′) Notes CL9GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT HPLC purification Adapter2.1CGACGCTCTTCCGATC/ddT/ HPLC purification Adapter2.2/5Phos/AGATCGGAAGAGCGTCGTGTAGGGAAAGAG*T*G*T*A HPLC purification CL78/5Phos/AGATCGGAAG/ISpC3/ISpC3/ISpC3/ISpC3/ISpC3/ISpC3/ Dual HPLC/ISpC3/ISpC3/ISpC3/ISpC3/3BioTEG/ purification

For IH02, the resulting library was sequenced to 30-fold coverage (779Mfragments). The fragment length distribution again exhibited a dominantpeak at ˜167 bp corresponding to the chromatosome, but was considerablyenriched for shorter fragments relative to conventional librarypreparation (FIGS. 21, 22, 23A-B, 24A-B). Although all libraries exhibit˜10.4 bp periodicity, the fragment sizes are offset by 3 bp for the twomethods, consistent with damaged or non-flush input molecules whose trueendpoints are more faithfully represented in single-stranded libraries.

A Genome-Wide Map of In Vivo Nucleosome Protection Based on Deep cfDNASequencing

To assess whether the predominant local positions of nucleosomes acrossthe human genome in tissue(s) contributing to cfDNA could be inferred bycomparing the distribution of aligned fragment endpoints, or amathematical transformation thereof, to one or more reference maps, aWindowed Protection Score (“WPS”) was developed. Specifically, it wasexpected that cfDNA fragment endpoints should cluster adjacent tonucleosome boundaries, while also being depleted on the nucleosomeitself. To quantify this, the WPS was developed, which represents thenumber of DNA fragments completely spanning a 120 bp window centered ata given genomic coordinate, minus the number of fragments with anendpoint within that same window (FIG. 25). As intended, the value ofthe WPS correlates with the locations of nucleosomes within stronglypositioned arrays, as mapped by other groups with in vitro methods orancient DNA (FIG. 26). At other sites, the WPS correlates with genomicfeatures such as DNase I hypersensitive (DHS) sites (e.g., consistentwith the repositioning of nucleosomes flanking a distal regulatoryelement) (FIG. 27).

A heuristic algorithm was applied to the genome-wide WPS of the BH01,IH01 and IH02 datasets to identify 12.6M, 11.9M, and 9.7M local maximaof nucleosome protection, respectively (FIGS. 25-31). In each sample,the mode of the distribution of distances between adjacent peaks was 185bp with low variance (FIG. 30), generally consistent with previousanalyses of the nucleosome repeat length in human or mouse cells.

To determine whether the positions of peak calls were similar acrosssamples, the genomic distance for each peak in a sample to the nearestpeak in each of the other samples was calculated. High concordance wasobserved (FIG. 31; FIGS. 32A-C). The median (absolute) distance from aBH01 peak call to a nearest-neighbor IH01 peak call was 23 bp overall,but was less than 10 bp for the most highly scored peaks (FIGS. 33A-B).

Because biases introduced either by nuclease specificity or duringlibrary preparation might artifactually contribute to the signal ofnucleosome protection, fragment endpoints were also simulated, matchingfor the depth, size distribution and terminal dinucleotide frequenciesof each sample. Genome-wide WPS were then calculated, and 10.3M, 10.2M,and 8.0M were called local maxima by the same heuristic, for simulateddatasets matched to BH01, IH01 and IH02, respectively. Peaks fromsimulated datasets were associated with lower scores than peaks fromreal datasets (FIGS. 33A-B). Furthermore, the relatively reproduciblelocations of peaks called from real datasets (FIG. 31; FIGS. 32A-C) didnot align well with the locations of peaks called from simulateddatasets (FIG. 31; FIGS. 34A-C).

To improve the precision and completeness of the genome-wide nucleosomemap, the cfDNA sequencing data from BH01, IH01, and IH02 were pooled andreanalyzed for a combined 231 fold-coverage (‘CH01’; 3.8 B fragments;Table 1). The WPS was calculated and 12.9M peaks were called for thiscombined sample. This set of peak calls was associated with higherscores and approached saturation in terms of the number of peaks (FIGS.33A-B). Considering all peak-to-peak distances that were less than 500bp (FIG. 35), the CH01 peak set spans 2.53 gigabases (Gb) of the humanreference genome.

Nucleosomes are known to be well-positioned in relation to landmarks ofgene regulation, for example transcriptional start sites and exon-intronboundaries. Consistent with that understanding, similar positioning wasobserved in this data as well, in relation to landmarks oftranscription, translation and splicing (FIGS. 36-40). Building on pastobservations of correlations between nucleosome spacing withtranscriptional activity and chromatin marks, the median peak-to-peakspacing within 100 kilobase (kb) windows that had been assigned tocompartment A (enriched for open chromatin) or compartment B (enrichedfor closed chromatin) on the basis of long-range interactions (in situHi-C) in a lymphoblastoid cell line was examined. Nucleosomes incompartment A exhibited tighter spacing than nucleosomes in compartmentB (median 187 bp (A) vs. 190 bp (B)), with further differences betweencertain subcompartments (FIG. 41). Along the length of chromosomes, nogeneral pattern was seen, except that median nucleosome spacing droppedsharply in pericentromeric regions, driven by strong positioning acrossarrays of alpha satellites (171 bp monomer length; FIG. 42; FIG. 26).

Short cfDNA Fragments Directly Footprint CTCF and Other TranscriptionFactors

Previous studies of DNase I cleavage patterns identified two dominantclasses of fragments: longer fragments associated with cleavage betweennucleosomes, and shorter fragments associated with cleavage adjacent totranscription factor binding sites (TFBS). To assess whether invivo-derived cfDNA fragments also resulted from two classes ofsensitivity to nuclease cleavage, sequence reads (CH01) were partitionedon the basis of inferred fragment length, and the WPS was recalculatedusing long fragments (120-180 bp; 120 bp window; effectively the same asthe WPS described above for nucleosome calling) or short fragments(35-80 bp; 16 bp window) separately (FIGS. 26-27). To obtain a set ofwell-defined TFBSs enriched for actively bound sites in our data,clustered FIMO predictions were intersected with a unified set ofChIP-seq peaks from ENCODE (TfbsClusteredV3) for each TF.

The long fraction WPS supports strong organization of nucleosomes in thevicinity of CTCF binding sites (FIG. 43). However, a strong signal inthe short fraction WPS is also observed that is coincident with the CTCFbinding site itself (FIGS. 44-45). CTCF binding sites were stratifiedbased on a presumption that they are bound in vivo (all FIMO predictionsvs. the subset intersecting with ENCODE ChIP-seq vs. the further subsetintersecting with those that appear to be utilized across 19 celllines). Experimentally well-supported CTCF sites exhibit a substantiallybroader spacing between the flanking −1 and +1 nucleosomes based on thelong fraction WPS, consistent with their repositioning upon CTCF binding(˜190 bp→˜260 bp; FIGS. 45-48). Furthermore, experimentallywell-supported CTCF sites exhibit a much stronger signal for the shortfraction WPS over the CTCF binding site itself (FIGS. 49-52).

Similar analyses were performed for additional TFs for which both FIMOpredictions and ENCODE CHiP-seq data were available (FIGS. 53A-H). Formany of these TFs, such as ETS and MAFK (FIGS. 54-55), a short fractionfootprint was observed, accompanied by periodic signal in the longfraction WPS. This is consistent with strong positioning of nucleosomessurrounding bound TFBS. Overall, these data support the view that shortcfDNA fragments, which are recovered markedly better by thesingle-stranded protocol (FIG. 18, FIG. 21), directly footprint the invivo occupancy of DNA-bound transcription factors, including CTCF andothers.

Nucleosome Spacing Patterns Inform cfDNA Tissues-of-Origin

To determine whether in vivo nucleosome protection, as measured throughcfDNA sequencing, could be used to infer the cell types contributing tocfDNA in healthy individuals, the peak-to-peak spacing of nucleosomecalls within DHS sites defined in 116 diverse biological samples wasexamined. Widened spacing was previously observed between the −1 and +1nucleosomes at regulatory elements (e.g., anecdotally at DHS sites (FIG.27) or globally at bound CTCF sites (FIG. 45)). Similar to bound CTCFsites, substantially broader spacing was observed for nucleosome pairswithin a subset of DHS sites, plausibly corresponding to sites at whichthe nucleosomes are repositioned by intervening transcription factorbinding in the cell type(s) giving rise to cfDNA (˜190 bp→˜260 bp; FIG.56). Indeed, the proportion of widened nucleosome spacing (˜260 bp)varies considerably depending on which cell type's DHS sites are used.However, all of the cell types for which this proportion is highest arelymphoid or myeloid in origin (e.g., CD3_CB-DS17706, etc. in FIG. 56).This is consistent with hematopoietic cell death as the dominant sourceof cfDNA in healthy individuals.

Next the signal of nucleosome protection in the vicinity oftranscriptional start sites was re-examined (FIG. 36). When the signalwas stratified based on gene expression in a lymphoid lineage cell line,NB-4, strong differences in the locations or intensity of nucleosomeprotection in relation to the TSS were observed, in highly vs. lowlyexpressed genes (FIG. 57). Furthermore, the short fraction WPS exhibitsa clear footprint immediately upstream of the TSS whose intensity alsostrongly correlates with expression level (FIG. 58). This plausiblyreflects footprinting of the transcription preinitiation complex, orsome component thereof, at transcriptionally active genes.

These data demonstrate that cfDNA fragmentation patterns do indeedcontain signal that might be used to infer the tissue(s) or cell-type(s)giving rise to cfDNA.

However, a challenge is that relatively few reads in a genome-wide cfDNAlibrary directly overlap DHS sites and transcriptional start sites.

Nucleosome spacing varies between cell types, and as a function ofchromatin state and gene expression. In general, open chromatin andtranscription are associated with a shorter nucleosome repeat length,consistent with this Example's analyses of compartment A vs. B (FIG.41). This Example's peak call data also exhibits a correlation betweennucleosome spacing across gene bodies and their expression levels, withtighter spacing associated with higher expression (FIG. 59; ρ=−0.17;n=19,677 genes). The correlation is highest for the gene body itself,relative to adjacent regions (upstream 10 kb ρ=−0.08; downstream 10 kbρ=−0.01). If the analysis is limited to gene bodies that span at least60 nucleosome calls, tighter nucleosome spacing is even more stronglycorrelated with gene expression (ρ=−0.50; n=12,344 genes).

One advantage of exploiting signals such as nucleosome spacing acrossgene bodies or other domains is that a much larger proportion of cfDNAfragments will be informative. Another potential advantage is thatmixtures of signals resulting from multiple cell types contributing tocfDNA might be detectable. To test this, a further mathematicaltransformation, fast Fourier transformation (FFT), was performed on thelong fragment WPS across the first 10 kb of gene bodies and on agene-by-gene basis. The intensity of the FFT signal correlated with geneexpression at specific frequency ranges, with a maximum at 177-180 bpfor positive correlation and a minimum at ˜199 bp for negativecorrelation (FIG. 60). In performing this analysis against a dataset of76 expression datasets for human cell lines and primary tissues, thestrongest correlations were with hematopoietic lineages (FIG. 60). Forexample, the most highly ranked negative correlations with averageintensity in the 193-199 bp frequency range for each of three healthysamples (BH01, IH01, IH02) were all to lymphoid cell lines, myeloid celllines, or bone marrow tissue (FIG. 61; Table 3):

TABLE 3 Correlation of WPS FFT intensities with gene expressiondatasets. Correlations Rank Differences RName Category Type DescriptionBH01 IH01 IH02 IC15 IC20 IC17 IC37 IC35 healthy IC15 IC20 IC17 IC37 IC35A.431 Skin Skin Epidermoid −0.298 −0.188 −0.149 −0.200 −0.140 −0.176−0.195 −0.178 2 3 −9 −9 −12 −21 cancer carcinoma (Squamous cell linecells) A549 Lung Lung Lung −0.269 −0.185 −0.144 −0.202 −0.139 −0.172−0.188 −0.170 3 −14 −12 −9 −2 −13 carcinoma carcinoma cell lineadipose_(—) Primary Adipose Primary −0.270 −0.169 −0.137 −0.169 −0.121−0.153 −0.166 −0.146 1 12 5 0 14 12 tissue Tissue tissue tissueadrenal_(—) Primary Adrenal Primary −0.257 −0.158 −0.131 −0.173 −0.118−0.145 −0.161 −0.138 −2 −11 −5 1 5 6 gland Tissue gland tissue AN3.CABreast/Female Uterine Metastatic −0.303 −0.194 −0.157 −0.213 −0.147−0.183 −0.195 −0.171 −4 −16 −13 −15 −6 −2 Reproductive cancerendometrial adenocarcinoma cell line appendix Primary Appendix Primary−0.287 −0.185 −0.137 −0.168 −0.118 −0.148 −0.171 −0.152 6 24 20 23 6 9Tissue tissue BEWO Other Uterine Metastatic −0.254 −0.184 −0.147 −0.193−0.139 −0.173 −0.194 −0.173 −5 3 −12 −15 −19 −27 cancer choriocarcinomacell line bone_(—) Primary Bone Primary −0.343 −0.230 −0.165 −0.192−0.142 −0.167 −0.193 −0.165 2 40 9 30 18 28 marrow Tissue marrow tissueCACO_2 Abdominal Colon Colon −0.281 −0.177 −0.137 −0.192 −0.128 −0.169−0.184 −0.164 5 −5 −5 −14 −10 −9 adenocarcinoma adenocarcinoma cell lineCAPAH_2 Abdominal Pancreas Pancreas −0.291 −0.187 −0.145 −0.202 −0.136−0.176 −0.195 −0.175 3 −12 −2 −10 −19 −25 adenocarcinoma adenocarcinomacell line cerebral_(—) Primary Cerebral Primary −0.225 −0.136 −0.120−0.168 −0.103 −0.134 −0.142 −0.125 −1 −9 −3 0 0 0 cortex Tissue cortextissue colon Primary Colon Primary −0.261 −0.162 −0.124 −0.154 −0.111−0.145 −0.168 −0.148 7 6 6 6 −7 1 Tissue tissue Daudi Lymphoid HumanHuman −0.321 −0.206 −0.153 −0.195 −0.133 −0.165 −0.169 −0.160 4 17 19 1913 24 Burkitt Burkitt lymphoma lymphoma cell line duodenum PrimaryDuodenum Primary −0.281 −0.164 −0.122 −0.159 −0.109 −0.144 −0.166 −0.14410 10 10 7 −4 7 Tissue tissue EFO.21 Breast/Female Ovarian Metastatic−0.287 −0.186 −0.149 −0.201 −0.140 −0.176 −0.168 −0.169 −7 −9 −14 −20 −1−8 Reproductive cancer ovarian corpus cystadenocarcinoma cell lineendometrium Primary Endometrium Primary −0.257 −0.158 −0.132 −0.170−0.119 −0.151 −0.156 −0.151 −3 −11 −4 −9 −3 −12 Tissue tissue esophagusPrimary Esophagus Primary −0.237 −0.147 −0.124 −0.156 −0.116 −0.141−0.158 −0.145 −3 1 −7 0 0 −7 Tissue tissue fallopian_(—) PrimaryFallopian Primary −0.247 −0.157 −0.129 −0.171 −0.114 −0.145 −0.161−0.145 −4 −13 −2 −3 3 −2 tube Tissue tube tissue gallbladder PrimaryGallbladder Primary −0.249 −0.156 −0.119 −0.153 −0.103 −0.138 −0.154−0.141 4 4 4 3 4 1 Tissue tissue HaCaT Skin Keratinocyte Keratinocyte−0.290 −0.186 −0.149 −0.194 −0.142 −0.173 −0.193 −0.173 −5 7 −18 −4 −6−17 cell line cell line HDLM.2 Lymphoid Hodgkin Hodgkin −0.316 −0.200−0.154 −0.201 −0.138 −0.173 −0.195 −0.171 1 6 11 1 −5 −5 lymphomalymphoma cell line heart_(—) Primary Heart Primary −0.248 −0.149 −0.126−0.168 −0.113 −0.141 −0.155 −0.140 −3 −3 −3 0 3 2 muscle Tissue muscletissue HEX_293 Other Kidney Embryonal −0.292 −0.187 −0.150 −0.209 −0.139−0.172 −0.159 −0.168 −4 −17 −4 −2 3 0 adrenal kidney cell precursorline, cell line transformed by adenovirus type 5 HEL MyeloidErythroleukemia Erythroleukemia −0.324 −0.205 −0.161 −0.210 −0.140−0.172 −0.194 −0.168 −1 −5 4 12 5 14 cell line (AML M8 in relapse aftertreatment for Hodgkin's disease) HeLa Breast/Female Cervical Cervical−0.286 −0.186 −0.149 −0.203 −0.139 −0.172 −0.190 −0.171 1 −10 −5 −3 1 −6Reproductive cancer epithelial adenocarcinoma cell line Hep_G2 AbdominalHepatocellular Hepatocellular −0.294 −0.186 −0.152 −0.202 −0.145 −0.186−0.196 −0.167 −4 −6 −18 −24 −17 2 carcinoma carcinoma cell line HL.60Myeloid Promyelocytic Acute −0.332 −0.208 −0.161 −0.202 −0.137 −0.171−0.197 −0.170 2 8 18 18 −1 11 leukemia promyelocytic leukemia (APL) cellline HMC.1 Myeloid Mastcell Mastcell −0.337 −0.228 −0.165 −0.212 −0.149−0.181 −0.199 −0.180 0 −1 −2 3 0 −2 leukemia leukemia cell line K.662Lymphoid Leukemia Chronic −0.317 −0.202 −0.158 −0.211 −0.143 −0.176−0.195 −0.166 −3 −9 −5 −5 −1 13 myeloid leukemia (CML) cell lineKarpas.707 Lymphoid Multiple Multiple −0.325 −0.210 −0.155 −0.195 −0.138−0.167 −0.188 −0.154 4 20 18 22 21 22 myeloma myeloma cell line kidneyPrimary Kidney Primary −0.245 −0.150 −0.130 −0.163 −0.119 −0.153 −0.171−0.147 −7 −4 −12 −19 −21 −6 Tissue tissue liver Primary Liver Primary−0.248 −0.148 −0.122 −0.150 −0.110 −0.150 −0.184 −0.138 1 4 −1 −13 −4 3Tissue tissue lung Primary Lung Primary −0.254 −0.170 −0.133 −0.170−0.121 −0.148 −0.167 −0.149 3 4 0 7 3 6 Tissue tissue lymph_node PrimaryLymph Primary −0.308 −0.195 −0.148 −0.162 −0.128 −0.155 −0.161 −0.156 724 17 25 14 22 Tissue node tissue MCF7 Breast/Female Breast Metastatic−0.298 −0.195 −0.154 −0.207 −0.145 −0.183 −0.196 −0.181 −3 −9 −12 −18−11 −19 Reproductive cancer breast adenocarcinoma cell line MOLT.4Lymphoid Leukemia Acute −0.323 −0.204 −0.163 −0.212 −0.144 −0.177 −0.197−0.173 −3 −7 −2 −1 −5 −1 (ALL) lymphoblastic leukemia (T-ALL) cell lineNB.4 Myeloid Promyelocytic Acute −0.348 −0.228 −0.172 −0.211 −0.148−0.182 −0.202 −0.171 0 4 3 5 2 13 leukemia promyelocytic leukemia (APL)cell line NTERA_2 Urinary/Male Urinary Metastatic −0.269 −0.170 −0.137−0.193 −0.117 −0.157 −0.169 −0.153 −2 −8 18 −2 5 0 Reproductive cancerembryonal carcinoma cell line, cloned from TERA-2 ovary Primary OvaryPrimary −0.268 −0.162 −0.135 −0.181 −0.120 −0.152 −0.166 −0.151 1 −7 2−2 6 −1 Tissue tissue pancreas Primary Pancreas Primary −0.250 −0.159−0.132 −0.170 −0.116 −0.150 −0.166 −0.150 −5 −5 1 −8 −5 −7 Tissue tissuePC.3 Urinary/Male Prostate Metastatic −0.295 −0.190 −0.151 −0.204 −0.138−0.174 −0.188 −0.173 −3 −10 2 −6 8 −12 Reproductive cancer poorlydifferentiated prostate adenocarcinoma cell line placenta PrimaryPlacenta Primary −0.266 −0.168 −0.134 −0.168 −0.126 −0.151 −0.168 −0.1503 10 −7 1 9 6 Tissue tissue prostate Primary Prostate Primary −0.248−0.161 −0.133 −0.175 −0.123 −0.150 −0.165 −0.151 −8 −10 −11 −8 1 −12Tissue tissue rectum Primary Rectum Primary −0.255 −0.154 −0.117 −0.159−0.102 −0.136 −0.161 −0.142 6 0 5 4 −2 0 Tissue tissue REH LymphoidLeukemia Pre-B cell −0.330 −0.216 −0.165 −0.214 −0.150 −0.162 −0.204−0.174 −2 −5 −5 −2 −4 1 (ALL) leukemia cell line (ALL, first relapse)RH.30 Sarcoma Rhabdomyosarcoma Metastatic −0.260 −0.165 −0.137 −0.194−0.125 −0.158 −0.175 −0.158 2 −14 −3 −7 −7 −7 rhabdomyosarcoma cell lineRPML3226 Lymphoid Multiple Multiple −0.322 −0.207 −0.155 −0.198 −0.138−0.169 −0.190 −0.164 1 18 11 19 14 22 Myeloma myeloma cell line RT4Urinary/Male Bladder Urinary −0.282 −0.168 −0.145 −0.192 −0.138 −0.170−0.191 −0.171 −5 −1 −12 −18 −19 −25 Reproductive cancer bladdertransitional cell carcinoma cell line salivary_(—) Primary SalivaryPrimary −0.262 −0.188 −0.138 −0.177 −0.126 −0.154 −0.172 −0.155 −7 2 −9−2 −5 −5 gland Tissue gland tissue SCLC.21H Lung Small cell Small cell−0.259 −0.160 −0.138 −0.201 −0.123 −0.157 −0.172 −0.148 −11 −31 −5 −12−10 8 lung lung carcinoma carcinoma cell line SH.SY5Y BrainNeuroblastoma Metastatic −0.271 −0.170 −0.137 −0.201 −0.124 −0.157−0.170 −0.151 2 −25 2 −5 1 6 neuroblastoma, cloned subline ofneuroepithelioma cell line SK-N-SM SiHa Breast/Female Cervical Cervical−0.288 −0.180 −0.148 −0.201 −0.139 −0.176 −0.193 −0.175 −2 −7 −15 −19−11 −27 Reproductive cancer squamous cell carcinoma cell line,integrated 1- 2 copies of HPV18 SK.BR_3 Breast/Female Breast Metastatic−0.288 −0.176 −0.148 −0.195 −0.140 −0.176 −0.191 −0.169 −3 −4 −21 −22−12 −11 Reproductive cancer breast adenocarcinoma cell line SK.MEL_30Primary Melanoma Metastatic −0.301 −0.187 −0.154 −0.208 −0.141 −0.174−0.193 −0.171 −2 −12 −8 −3 −1 −6 Tissue malignant melanoma cell lineskeletal_(—) Primary Skeletal Primary −0.261 −0.166 −0.134 −0.179 −0.125−0.150 −0.164 −0.145 −1 −7 −7 0 9 11 muscle Tissue muscle tissue skinSkin Skin Primary −0.259 −0.166 −0.134 −0.168 −0.127 −0.148 −0.167−0.151 −4 8 −14 5 −1 −4 tissue small_(—) Primary Small Primary −0.260−0.164 −0.121 −0.158 −0.107 −0.141 −0.166 −0.142 9 10 11 9 0 7 intestineTissue intestin tissue smooth_(—) Primary Smooth Primary −0.259 −0.158−0.127 −0.169 −0.113 −0.144 −0.161 −0.149 2 −6 3 4 4 −5 muscle Tissuemuscle tissue spleen Primary Spleen Primary −0.308 −0.202 −0.148 −0.160−0.130 −0.155 −0.177 −0.154 7 27 15 25 20 25 Tissue tissue stomachPrimary Stomach Primary −0.264 −0.170 −0.131 −0.170 −0.117 −0.149 −0.169−0.151 6 3 9 6 0 2 Tissue tissue testis Primary Testis Primary −0.215−0.142 −0.109 −0.147 −0.093 −0.128 −0.133 −0.123 0 0 0 0 0 0 Tissuetissue TMP.1 Myeloid Monocytic Acute −0.338 −0.218 −0.166 −0.206 −0.149−0.162 −0.204 −0.176 −1 6 −1 1 −3 0 leukemia monocytic leukemia (AML)cell line thyroid_(—) Primary Thyroid Primary −0.261 −0.158 −0.138−0.178 −0.121 −0.153 −0.170 −0.161 −2 −7 −2 −6 −6 −19 gland Tissue glandtissue TiME Other Microvascular Telomerase −0.296 −0.180 −0.147 −0.198−0.134 −0.170 −0.186 −0.170 5 −3 3 −1 3 −11 endothellial immortalizedcell line human microvascular endothelial cells (pooled) tonsil PrimaryTonsil Primary −0.282 −0.179 −0.141 −0.169 −0.125 −0.147 −0.173 −0.152−1 20 8 23 4 9 Tissue tissue U.139_MG Brain Glioblastoma Glioblastoma−0.288 −0.177 −0.144 −0.191 −0.126 −0.162 −0.177 −0.161 1 8 7 0 2 2 cellline U.2_O5 Sarcoma Osteosarcoma Osteosarcoma −0.275 −0.175 −0.139−0.192 −0.134 −0.159 −0.170 −0.160 −2 0 −11 −3 6 −3 cell line U.2197Sarcoma Sarcoma Malignant −0.290 −0.181 −0.146 −0.195 −0.129 −0.184−0.180 −0.165 2 1 5 3 4 0 fibrous histiocytoma cell line U.251_MG BrainGlioblastoma Glioblastoma −0.292 −0.178 −0.140 −0.197 −0.125 −0.180−0.177 −0.165 9 −6 11 4 4 −4 cell line U.266.70 Lymphoid MultipleMultiple −0.320 −0.207 −0.157 −0.202 −0.135 −0.170 −0.191 −0.165 −1 4 1915 12 17 Myeloma myeloma cell line (1970, IL-6- dependent) U.266.64Lymphoid Multiple Multiple −0.320 −0.212 −0.182 −0.207 −0.139 −0.175−0.194 −0.169 −1 2 11 8 10 14 Myeloma myeloma cell line (1984, in vitrodifferentiated) U.698 Lymphoid B-cell B-cell −0.328 −0.212 −0.159 −0.203−0.137 −0.170 −0.194 −0.165 2 5 18 20 6 20 lymphoma lymphoma cell line(lymphoblastic lymphosarcoma) U.87_MG Brain Glioblastoma, Glioblastoma,−0.285 −0.175 −0.143 −0.192 −0.127 −0.160 −0.174 −0.162 1 0 2 −2 2 −4astrocytoma astrocytoma cell line U.937 Myeloid MyelomonocyticMyelomonocytic −0.346 −0.224 −0.167 −0.201 −0.146 −0.150 −0.199 −0.173 118 3 5 2 6 histiocytic histiocytic lymphoma lymphoma cell lineurinary_(—) Primary Urinary Primary −0.260 −0.158 −0.130 −0.165 −0.113−0.148 −0.164 −0.150 3 5 −2 1 3 −6 bladder Tissue bladder tissue WM.116Skin Melanoma Malignant −0.284 −0.175 −0.144 −0.193 −0.130 −0.160 −0.178−0.157 −1 −4 −4 −3 −3 2 melanoma cell line

Correlation values between average FFT (fast Fourier Transformation)intensities for the 193-199 bp frequencies in the first 10 kb downstreamof the transcriptional start site with FPKM expression values measuredfor 19,378 Ensembl gene identifiers in 44 human cell lines and 32primary tissues by the Human Protein Atlas. Table 3 also contains briefdescriptions for each of the expression samples as provided by theProtein Atlas as well as rank transformations and rank differences tothe IH01, IH02 and BH01 samples.

Example 5 Determining Non-Healthy Tissue(s)-of-Origin from cfDNA

To test whether additional contributing tissues in non-healthy statesmight be inferred, cfDNA samples obtained from five late-stage cancerpatients were sequenced. The patterns of nucleosome spacing in thesesamples revealed additional contributions to cfDNA that correlated moststrongly with non-hematopoietic tissues or cell lines, often matchingthe anatomical origin of the patient's cancer.

Nucleosome Spacing in Cancer Patients' cfDNA IdentifiesNon-Hematopoietic Contributions

To determine whether signatures of non-hematopoietic lineagescontributing to circulating cfDNA in non-healthy states could bedetected, 44 plasma samples from individuals with clinical diagnoses ofa variety of Stage IV cancers were screened with light sequencing ofsingle-stranded libraries prepared from cfDNA (Table 4; median 2.2-foldcoverage):

TABLE 4 Clinical diagnoses and cfDNA yield for cancer panel. cfDNA YieldPatient Sample ID Clinical Dx Stage (ng/ml) Sex IC01 † Kidney cancer(Transitional IV 242 F cell) IC02 Ovarian cancer (undefined) IV 22.5 FIC03 Skin cancer (Melanoma) IV 12.0 M IC04 Breast cancer IV 12.6 F(Invasive/infiltrating ductal) IC05 Lung cancer IV 5.4 M(Adenocarcinoma) IC06 Lung cancer (Mesothelioma) IV 11.4 M IC07 †Gastric cancer (undefined) IV 52.2 M IC08 Uterine cancer (undefined) IV15.0 F IC09 Ovarian cancer (serous IV 8.4 F tumors) IC10 Lung cancer IV11.4 F (adenocarcinoma) IC11 Colorectal cancer (undefined) IV 11.4 MIC12 Breast cancer IV 12.0 F (Invasive/infiltrating lobular) IC13Prostate cancer (undefined) IV 12.3 M IC14 Head and neck cancer IV 27.0M (undefined) IC15 § Lung cancer (Small cell) IV 22.5 M IC16 Bladdercancer (undefined) IV 14.1 M IC17 § Liver cancer (Hepatocellular IV 39.0M carcinoma) IC18 Kidney cancer (Clear cell) IV 10.5 F IC19 Testicularcancer IV 9.6 M (Seminomatous) IC20 § Lung cancer (Squamous cell IV 21.9M carcinoma) IC21 Pancreatic cancer (Ductal IV 35.4 M adenocarcinoma)IC22 Lung cancer IV 11.4 F (Adenocarcinoma) IC23 Liver cancer(Hepatocellular IV 17.1 M carcinoma) IC24 Pancreatic cancer (Ductal IV37.2 M adenocarcinoma) IC25 Pancreatic cancer (Ductal IV 27.9 Madenocarcinoma) IC26 Prostate cancer IV 24.6 M (Adenocarcinoma) IC27Uterine cancer (undefined) IV 19.2 F IC28 Lung cancer (Squamous cell IV33.3 M carcinoma) IC29 Head and neck cancer IV 14.4 M (undefined) IC30Esophageal cancer IV 10.5 M (undefined) IC31 † Ovarian cancer(undefined) IV 334.8 F IC32 Lung cancer (Small cell) IV 9.6 F IC33Colorectal cancer IV 13.8 M (Adenocarcinoma) IC34 Breast cancer IV 33.6F (Invasive/infiltrating lobular) IC35 § Breast cancer (Ductal IV 16.2 Fcarcinoma in situ) IC36 Liver cancer (undefined) IV 26.4 M IC37 §Colorectal cancer IV 15.9 F (Adenocarcinoma) IC38 Bladder cancer(undefined) IV 6.6 M IC39 Kidney cancer (undefined) IV 39.0 M IC40Prostate cancer IV 13.8 M (Adenocarcinoma) IC41 Testicular cancer IV16.5 M (Seminomatous) IC42 Lung cancer IV 11.4 F (Adenocarcinoma) IC43Skin cancer (Melanoma) IV 21.9 F IC44 Esophageal cancer IV 25.8 F(undefined) IC45 † Colorectal cancer IV 3.0 M (Adenocarcinoma) IC46 **Breast cancer (Ductal IV 36.6 F carcinoma in situ) IC47 Pancreaticcancer (Ductal IV 19.2 F adenocarcinoma) IC48 ** Breast cancer IV 13.8 F(Invasive/infiltrating lobular) §: sample was selected for additionalsequencing. ** only 0.5 ml of plasma was available for this sample. †:sample failed QC and was not used for further analysis.

Table 4 shows clinical and histological diagnoses for 48 patients fromwhom plasma-borne cfDNA was screened for evidence of high tumor burden,along with total cfDNA yield from 1.0 ml of plasma from each individualand relevant clinical covariates. Of these 48, 44 passed QC and hadsufficient material. Of these 44, five were selected for deepersequencing. cfDNA yield was determined by Qubit Fluorometer 2.0 (LifeTechnologies).

These samples were prepared with the same protocol and many in the samebatch as IH02 of Example 4. Human peripheral blood plasma for 52individuals with clinical diagnosis of Stage IV cancer (Table 4) wasobtained from Conversant Bio or PlasmaLab International (Everett, Wash.,USA) and stored in 0.5 ml or 1 ml aliquots at −80° C. until use. Humanperipheral blood plasma for four individuals with clinical diagnosis ofsystemic lupus erythematosus was obtained from Conversant Bio and storedin 0.5 ml aliquots at −80° C. until use. Frozen plasma aliquots werethawed on the bench-top immediately before use. Circulating cell-freeDNA was purified from 2 ml of each plasma sample with the QiaAMPCirculating Nucleic Acids kit (Qiagen) as per the manufacturer'sprotocol. DNA was quantified with a Qubit fluorometer (Invitrogen). Toverify cfDNA yield in a subset of samples, purified DNA was furtherquantified with a custom qPCR assay targeting a multicopy human Alusequence; the two estimates were found to be concordant.

Because matched tumor genotypes were not available, each sample wasscored on two metrics of aneuploidy to identify a subset likely tocontain a high proportion of tumor-derived cfDNA: first, the deviationfrom the expected proportion of reads derived from each chromosome (FIG.62A); and second, the per-chromosome allele balance profile for a panelof common single nucleotide polymorphisms (FIG. 62B). Based on thesemetrics, single-stranded libraries derived from five individuals (with asmall cell lung cancer, a squamous cell lung cancer, a colorectaladenocarcinoma, a hepatocellular carcinoma, and a ductal carcinoma insitu breast cancer) were sequenced to a depth similar to that of IH02 inExample 4 (Table 5; mean 30-fold coverage):

TABLE 5 Sequencing statistics for additional samples included in CA01set. Sample Library Fragments Aligned Est. % name type Reads sequencedAligned Q30 Coverage duplicates 35-80 bp 120-180 bp IH03 SSP 2x3953292855 92.66% 72.37% 2.29 15.46% 11.05% 52.34% IP01 † DSP 2x101,1214536629 97.22% 86.38% 76.11 0.55% 0.08% 62.77% 2x102 IP02 † DSP2x101, 655040273 97.16% 87.72% 52.45 0.83% 0.07% 68.10% 2x102 IA01 SSP2x39 53934607 87.42% 68.30% 2.02 22.70% 15.20% 49.77% IA02 SSP 2x3942496222 95.42% 76.61% 1.95 4.74% 12.28% 59.00% IA03 SSP 2x39 5127848993.12% 71.33% 2.05 25.68% 14.27% 52.57% IA04 SSP 2x39 50768476 90.30%70.51% 2.14 7.83% 17.80% 36.76% IA05 DSP 2x101 194985271 98.80% 90.61%11.09 12.05% 2.24% 71.67% IA06 DSP 2x101 171670054 98.90% 90.88% 9.905.41% 1.93% 71.26% IA07 DSP 2x101 208609489 98.67% 90.34% 11.69 11.45%2.59% 74.84% IA08 DSP 2x101 193729556 98.61% 90.70% 10.84 11.96% 2.58%76.24% IC02 SSP 2x39 57913605 95.07% 75.57% 2.59 5.40% 12.98% 60.00%IC03 SSP 2x39 53862631 96.78% 75.66% 2.79 8.32% 13.25% 62.20% IC04 SSP2x39 55239248 95.47% 76.26% 2.57 8.28% 10.98% 58.48% IC05 SSP 2x3939623850 89.80% 69.92% 1.60 9.24% 14.63% 50.33% IC06 SSP 2x39 5967998195.57% 74.90% 2.11 3.93% 24.30% 41.46% IC08 SSP 2x39 46933688 94.38%74.21% 1.92 5.92% 16.04% 45.25% IC09 SSP 2x42 59639583 91.22% 71.15%2.13 6.69% 21.39% 43.50% IC10 SSP 2x42 53994406 93.73% 73.40% 1.83 2.00%27.08% 37.62% IC11 SSP 2x42 59225460 93.26% 72.51% 2.15 5.26% 21.30%43.33% IC12 SSP 2x42 57884742 93.52% 74.33% 2.34 2.66% 18.28% 46.58%IC13 SSP 2x42 71946779 92.94% 72.47% 2.52 2.18% 23.51% 43.97% IC14 SSP2x42 61649203 94.54% 73.47% 2.20 3.23% 22.26% 43.37% IC15 SSP 2x50,908512803 95.49% 76.83% 29.77 10.66% 25.42% 38.47% 43/42 IC16 SSP 2x4262739733 92.81% 72.85% 2.47 2.77% 17.71% 48.04% IC17 SSP 2x50,1072374044 96.02% 76.42% 42.08 12.16% 17.08% 50.02% 2x39 IC18 SSP 2x3959976914 87.91% 68.67% 2.24 4.39% 18.85% 44.44% IC19 SSP 2x39 5144714989.38% 69.39% 2.02 8.24% 17.30% 46.33% IC20 SSP 2x50, 640838540 96.30%79.11% 23.36 12.43% 25.72% 33.87% 2x39 IC21 SSP 2x39 53000679 94.64%74.57% 1.79 37.39% 29.89% 43.81% IC22 SSP 2x39 58102606 94.08% 74.08%2.51 6.24% 13.65% 58.41% IC23 SSP 2x39 65859970 95.67% 75.67% 2.94 5.34%11.09% 60.86% IC24 SSP 43/42 66344431 94.63% 74.46% 2.46 2.00% 22.46%45.31% IC25 SSP 43/42 75066833 93.75% 73.66% 2.86 2.24% 21.30% 46.19%IC26 SSP 43/42 79180860 92.59% 72.32% 2.97 2.93% 22.34% 40.42% IC27 SSP43/42 78037377 88.81% 67.04% 2.20 1.50% 31.31% 30.59% IC28 SSP 43/4261402081 95.24% 75.74% 2.60 2.46% 15.71% 46.44% IC29 SSP 2x39 4998952294.45% 73.36% 1.75 3.03% 25.82% 36.23% IC30 SSP 2x39 58439504 93.52%71.19% 1.75 17.35% 29.58% 30.47% IC32 SSP 43/42 78233981 87.86% 66.80%2.25 1.79% 30.12% 31.20% IC33 SSP 43/42 62196185 87.26% 66.71% 1.931.93% 27.44% 36.92% IC34 SSP 43/42 63572169 95.42% 76.74% 2.53 2.35%19.64% 48.55% IC35 SSP 43/42 618664393 86.47% 65.90% 18.22 5.23% 28.18%35.24% IC36 SSP 43/42 54402943 94.62% 74.73% 2.21 3.32% 17.02% 52.42%IC37 SSP 2x50, 1175553677 93.00% 74.46% 38.22 10.15% 28.47% 35.11% 43/42IC38 SSP 43/42 47981963 89.35% 69.45% 1.78 6.47% 18.59% 43.03% IC39 SSP43/42 61968854 95.29% 75.57% 2.62 2.54% 14.42% 57.28% IC40 SSP 2x3953228209 93.54% 71.69% 1.81 8.65% 24.88% 34.95% IC41 SSP 43/42 7808165587.11% 65.25% 2.26 1.61% 27.94% 35.21% IC42 SSP 2x39 53017317 93.59%74.33% 2.02 10.74% 19.04% 44.12% IC43 SSP 43/42 76395478 88.41% 67.21%2.40 1.56% 26.68% 37.76% IC44 SSP 43/42 61354307 95.15% 74.88% 2.454.34% 19.10% 46.39% IC46 SSP 2x39 60123123 94.51% 72.23% 2.13 10.37%15.46% 50.93% IC47 SSP 2x39 59438172 95.58% 73.84% 2.07 9.33% 21.67%43.34% IC48 SSP 43/42 55704417 91.35% 72.79% 2.01 13.87% 22.56% 38.68%IC49 DSP 2x101 170489015 99.02% 90.53% 11.19 5.93% 2.41% 59.93% IC50 DSP2x101 203828224 98.72% 90.28% 10.82 2.83% 4.81% 66.23% IC51 DSP 2x101200454421 98.63% 90.53% 11.77 9.50% 2.58% 67.04% IC52 DSP 2x101186975845 98.97% 91.25% 11.37 2.57% 0.83% 68.96% SSP, single-strandedlibrary preparation protocol. DSP, double-stranded library preparationprotocol. † Sample has been previously published (J. O. Kitzman et al.,Science Translational Medicine (2012)).

Table 5 tabulates sequencing-related statistics, including the totalnumber of fragments sequenced, read lengths, the percentage of suchfragments aligning to the reference with and without a mapping qualitythreshold, mean coverage, duplication rate, and the proportion ofsequenced fragments in two length bins, for each sample. Fragment lengthwas inferred from alignment of paired-end reads. Due to the short readlengths, coverage was calculated by assuming the entire fragment hadbeen read. The estimated number of duplicate fragments is based onfragment endpoints, which may overestimate the true duplication rate inthe presence of highly stereotyped cleavage.

As described above, FFT was performed on the long fragment WPS valuesacross gene bodies and correlated the average intensity in the 193-199bp frequency range against the same 76 expression datasets for humancell lines and primary tissues. In contrast with the three samples fromhealthy individuals from Example 4 (where all of the top 10, and nearlyall of the top 20, correlations were to lymphoid or myeloid lineages),many of the most highly ranked cell lines or tissues representnon-hematopoietic lineages, in some cases aligning with the cancer type(FIG. 61; Table 3). For example, for IC17, where the patient had ahepatocellular carcinoma, the top-ranked correlation was with HepG2, ahepatocellular carcinoma cell line. For IC35, where the patient had aductal carcinoma in situ breast cancer, the top-ranked correlation waswith MCF7, a metastatic breast adenocarcinoma cell line. In other cases,the cell lines or primary tissues that exhibit the greatest change incorrelation rank aligned with the cancer type. For example, for IC15,where the patient had small cell lung cancer, the largest change incorrelation rank (−31) was for a small cell lung cancer cell line(SCLC-21H). For IC20 (a lung squamous cell carcinoma) and IC35 (acolorectal adenocarcinoma), there were many non-hematopoietic cancercell lines displacing the lymphoid/myeloid cell lines in terms ofcorrelation rank, but the alignment of these to the specific cancer typewas less clear. It is possible that the specific molecular profile ofthese cancers was not well-represented amongst the 76 expressiondatasets (e.g., none of these are lung squamous cell carcinomas; CACO-2is a cell line derived from a colorectal adenocarcinoma, but is known tobe highly heterogeneous).

A greedy, iterative approach was used to estimate the proportions ofvarious cell-types and/or tissues contributing to cfDNA derived from thebiological sample. First, the cell-type or tissue whose reference map(here, defined by the 76 RNA expression datasets) had the highestcorrelation with the average FFT intensity in the 193-199 bp frequencyof the WPS long fragment values across gene bodies for a given cfDNAsample was identified. Next, a series of “two tissue” linear mixturemodels were fitted, including the cell-type or tissue with the highestcorrelation as well as each of the other remaining cell-types or tissuesfrom the full set of reference maps. Of the latter set, the cell-type ortissue with the highest coefficient was retained as contributory, unlessthe coefficient was below 1% in which case the procedure was terminatedand this last tissue or cell-type not included. This procedure wasrepeated, i.e. “three-tissue”, “four-tissue”, and so on, untiltermination based on the newly added tissue being estimated by themixture model to contribute less than 1%. The mixture model takes theform:

argmax_{a,b,c, . . . } cor(Mean_FFTintensity_193-199, a*log2ExpTissue1+b*log 2Tissue2+c*log 2Tissue3+ . . . +(1-a-b-c- . . . )*log2ExpTissueN).

For example, for IC17, a cfDNA sample derived from a patient withadvanced hepatocellular carcinoma, this procedure predicted 9contributory cell types, including Hep_G2 (28.6%), HMC.1 (14.3%), REH(14.0%), MCF7 (12.6%), AN3.CA (10.7%), THP.1 (7.4%), NB.4 (5.5%),U.266.84 (4.5%), and U.937 (2.4%). For BH01, a cfDNA samplecorresponding to a mixture of healthy individuals, this procedurepredicted 7 contributory cell types or tissues, including bone marrow(30.0%), NB.4 (19.6%), HMC.1 (13.9%), U.937 (13.4%), U.266.84 (12.5%),Karpas.707 (6.5%), and REH (4.2%). Of note, for IC17, the sample derivedfrom a cancer patient, the highest proportion of predicted contributioncorresponds to a cell line that is closely associated with the cancertype that is present in the patient from whom this cfDNA was derived(Hep_G2 and hepatocellular carcinoma). In contrast, for BH01, thisapproach predicts contributions corresponding only to tissues or celltypes that are primarily associated with hematopoiesis, the predominantsource of plasma cfDNA in healthy individuals.

Example 6 General Methods for Examples 4-5 Samples

Bulk human peripheral blood plasma, containing contributions from anunknown number of healthy individuals, was obtained from STEMCELLTechnologies (Vancouver, British Columbia, Canada) and stored in 2 mlaliquots at −80° C. until use. Individual human peripheral blood plasmafrom anonymous, healthy donors was obtained from Conversant Bio(Huntsville, Ala., USA) and stored in 0.5 ml aliquots at −80° C. untiluse.

Whole blood from pregnant women IP01 and IP02 was obtained at 18 and 13gestational weeks, respectively, and processed as previously described41.

Human peripheral blood plasma for 52 individuals with clinical diagnosisof Stage IV cancer (Supplementary Table 4) was obtained from ConversantBio or PlasmaLab International (Everett, Wash., USA) and stored in 0.5ml or 1 ml aliquots at −80° C. until use. Human peripheral blood plasmafor four individuals with clinical diagnosis of systemic lupuserythematosus was obtained from Conversant Bio and stored in 0.5 mlaliquots at −80° C. until use.

Processing of Plasma Samples

Frozen plasma aliquots were thawed on the bench-top immediately beforeuse. Circulating cell-free DNA was purified from 2 ml of each plasmasample with the QiaAMP Circulating Nucleic Acids kit (Qiagen) as per themanufacturer's protocol. DNA was quantified with a Qubit fluorometer(Invitrogen). To verify cfDNA yield in a subset of samples, purified DNAwas further quantified with a custom qPCR assay targeting a multicopyhuman Alu sequence; the two estimates were found to be concordant.

Preparation of Double-Stranded Sequencing Libraries

Barcoded sequencing libraries were prepared with the ThruPLEX-FD orThruPLEX DNA-seq 48 D kits (Rubicon Genomics), comprising a proprietaryseries of end-repair, ligation, and amplification reactions. Between 0.5ng and 30.0 ng of cfDNA were used as input for all clinical samplelibraries. Library amplification for all samples was monitored byreal-time PCR to avoid over-amplification, and was typically terminatedafter 4-6 cycles.

Preparation of Single-Stranded Sequencing Libraries

Adapter 2 was prepared by combining 4.5 μl TE (pH 8), 0.5 μl 1M NaCl, 10μl 500 uM oligo Adapter2.1, and 10 μl 500 μM oligo Adapter2.2,incubating at 95° C. for 10 seconds, and decreasing the temperature to14° C. at a rate of 0.1° C/s. Purified cfDNA fragments weredephosphorylated by combining 2× CircLigase II buffer (Epicentre), 5 mMMnCl₂, and 1 U FastAP alkaline phosphatase (Thermo Fisher) with 0.5-10ng fragments in a 20 μl reaction volume and incubating at 37° C. for 30minutes. Fragments were then denatured by heating to 95° C. for 3minutes, and were immediately transferred to an ice bath. The reactionwas supplemented with biotin-conjugated adapter oligo CL78 (5 pmol), 20%PEG-6000 (w/v), and 200 U CircLigase II (Epicentre) for a total volumeof 40 μl, and was incubated overnight with rotation at 60° C., heated to95° C. for 3 minutes, and placed in an ice bath. For each sample, 20 plMyOne C1 beads (Life Technologies) were twice washed in bead bindingbuffer (BBB) (10 mM Tris-HCl [pH 8], 1M NaCl, 1 mM EDTA [pH 8], 0.05%Tween-20, and 0.5% SDS), and resuspended in 250 μl BBB. Adapter-ligatedfragments were bound to the beads by rotating for 60 minutes at roomtemperature. Beads were collected on a magnetic rack and the supernatantwas discarded. Beads were washed once with 500 ul wash buffer A (WBA)(10 mM Tris-HCl [pH 8], 1 mM EDTA [pH 8], 0.05% Tween-20, 100 mM NaCl,0.5% SDS) and once with 500 μl wash buffer B (WBB) (10 mM Tris-HCl [pH8], 1 mM EDTA [pH 8], 0.05% Tween-20, 100 mM NaCl). Beads were combinedwith 1× Isothermal Amplification Buffer (NEB), 2.5 μM oligo CL9, 250 82M (each) dNTPs, and 24 U Bst 2.0 DNA Polymerase (NEB) in a reactionvolume of 50 μl, incubated with gentle shaking by ramping temperaturefrom 15° C. to 37° C. at 1° C./minute, and held at 37° C. for 10minutes. After collection on a magnetic rack, beads were washed oncewith 200 μl WBA, resuspended in 200 μl of stringency wash buffer (SWB)(0.1×SSC, 0.1% SDS), and incubated at 45° C. for 3 minutes. Beads wereagain collected and washed once with 200 μl WBB. Beads were thencombined with 1× CutSmart Buffer (NEB), 0.025% Tween-20, 100 μM (each)dNTPs, and 5 U T4 DNA Polymerase (NEB) and incubated with gentle shakingfor 30 minutes at room temperature. Beads were washed once with each ofWBA, SWB, and WBB as described above. Beads were then mixed with 1×CutSmart Buffer (NEB), 5% PEG-6000, 0.025% Tween-20, 2 μMdouble-stranded adapter 2, and 10 U T4 DNA Ligase (NEB), and incubatedwith gentle shaking for 2 hours at room temperature. Beads were washedonce with each of WBA, SWB, and WBB as described above, and resuspendedin 25 μl TET buffer (10 mM Tris-HCl [pH 8], 1 mM EDTA [pH 8], 0.05%Tween-20). Second strands were eluted from beads by heating to 95° C.,collecting beads on a magnetic rack, and transferring the supernatant toa new tube. Library amplification for all samples was monitored byreal-time PCR to avoid over-amplification, and required an average of 4to 6 cycles per library.

Sequencing

All libraries were sequenced on HiSeq 2000 or NextSeq 500 instruments(Illumina).

Primary Sequencing Data Processing

Barcoded paired end (PE) Illumina sequencing data was split allowing upto one substitution in the barcode sequence. Reads shorter or equal toread length were consensus called and adapter trimmed. Remainingconsensus single end reads (SR) and the individual PE reads were alignedto the human reference genome sequence (GRCh37, 1000 Genomes phase 2technical reference downloaded fromftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/)using the ALN algorithm implemented in BWA v0.7.10. PE reads werefurther processed with BWA SAMPE to resolve ambiguous placement of readpairs or to rescue missing alignments by a more sensitive alignment steparound the location of one placed read end. Aligned SR and PE data wasdirectly converted to sorted BAM format using the SAMtools API. BAMfiles of the sample were merged across lanes and sequencing runs.

Quality control was performed using FastQC (v0.11.2), obtaining alibrary complexity estimate (Picard tools v1.113), determining theproportion of adapter dimers, the analysis of the inferred libraryinsert size, the nucleotide and dinucleotide frequencies at the outerreads ends as well as checking the mapping quality distributions of eachlibrary.

Simulated Read Data Sets

Aligned sequencing data was simulated (SR if shorter than 45 bp, PE 45bp otherwise) for all major chromosomes of the human reference (GRC37h).For this purpose, dinucleotide frequencies were determined from realdata on both read ends and both strand orientations. Dinucleotidefrequencies were also recorded for the reference genome on both strands.Further, the insert size distribution of the real data was extracted forthe 1-500 bp range. Reads were simulated by iterating through thesequence of the major reference chromosomes. At each step (i.e., one ormore times at each position depending on desired coverage), (1) thestrand is randomly chosen, (2) the ratio of the dinucleotide frequencyin the real data over the frequency in the reference sequence is used torandomly decide whether the initiating dinucleotide is considered, (3)an insert size is sampled from the provided insert-size distribution and(4) the frequency ratio of the terminal dinucleotide is used to randomlydecide whether the generated alignment is reported. The simulatedcoverage was matched to that of the original data after PCR duplicateremoval.

Coverage, Read Starts and Window Protection Scores

The data of the present disclosure provides information about the twophysical ends of DNA molecules used in sequencing library preparation.We extract this information using the SAMtools application programminginterface (API) from BAM files. As read starts, we use both outeralignment coordinates of PE data for which both reads aligned to thesame chromosome and where reads have opposite orientations. In caseswhere PE data was converted to single read data by adapter trimming, weconsider both end coordinates of the SR alignment as read starts. Forcoverage, we consider all positions between the two (inferred) moleculeends, including these end positions. We define windowed protectionscores (WPS) of a window size k as the number of molecules spanning awindow minus those starting at any bases encompassed by the window. Weassign the determined WPS to the center of the window. For molecules inthe 35-80 bp range (short fraction), we use a window size of 16 and, formolecules in the 120-180 bp (long fraction), we use a window size of120.

Nucleosome Peak Calling

Local maxima of nucleosome protection are called from the long fractionWPS, which we locally adjust to a running median of zero (1 kb window)and smooth using a Savitzky-Golay filter (window size 21, 2nd orderpolynomial). The WPS track is then segmented into above zero regions(allowing up to 5 consecutive positions below zero). If the resultingregion is between 50-150 bp long, we identify the median value of thatregion and search for the maximum-sum contiguous window above themedian. We report the start, end and center coordinates of this window.Peak-to-peak distances, etc., are calculated from the centercoordinates. The score of the call is determined as the distance betweenmaximum value in the window and the average of the two adjacent WPSminima neighboring the region. If the identified region is 150-450 bplong, we apply the same above median contiguous window approach, butonly report those windows that are between 50-150 bp in size. For scorecalculation of multiple windows derived from the 150-450 bp regions, weassume the neighboring minima within the region to be zero. We discardregions shorter than 50 bp and longer than 450 bp.

Dinucleotide Composition of 167 bp Fragments

Fragments with inferred lengths of exactly 167 bp, corresponding to thedominant peak of the fragment size distribution, were filtered withinsamples to remove duplicates. Dinucleotide frequencies were calculatedin a strand-aware manner, using a sliding 2 bp window and referencealleles at each position, beginning 50 bp upstream of one fragmentendpoint and ending 50 bp downstream of the other endpoint. Observeddinucleotide frequencies at each position were compared to expecteddinucleotide frequencies determined from a set of simulated readsreflecting the same cleavage biases calculated in a library-specificmanner (see above for details).

WPS Profiles Surrounding Transcription Factor Binding Sites and GenomicFeatures

Analysis began with an initial set of clustered FIMO (motif-based)intervals defining a set of computationally predicted transcriptionfactor binding sites. For a subset of clustered transcription factors(AP-2-2, AP-2, CTCF_Core-2, E2F-2, EBF1, Ebox-CACCTG, Ebox, ESR1, ETS,IRF-2, IRF-3, IRF, MAFK, MEF2A-2, MEF2A, MYC-MAX, PAXS-2, RUNX2,RUNX-AML, STAF-2, TCF-LEF, YY1), the set of sites was refined to a moreconfident set of actively bound transcription factor binding sites basedon experimental data. For this purpose, only predicted binding sitesthat overlap with peaks defined by ChlP-seq experiments from publicallyavailable ENCODE data (TfbsClusteredV3 set downloaded from UCSC) wereretained.

Windowed protection scores surrounding these sites were extracted forboth the CH01 sample and the corresponding simulation. A protectionscore for each site/feature was calculated at each position relative tothe start coordinate of each binding site and the aggregated. Plots ofCTCF binding sites were shifted such that the zero coordinate on thex-axis at the center of the known 52 bp binding footprint of CTCF. Themean of the first and last 500 bp (which is predominantly flat andrepresents a mean offset) of the 5 kb extracted WPS signal was thensubtracted from the original signal. For long fragment signal only, asliding window mean was calculated using a 200 bp window and subtractedfrom the original signal. Finally, the corrected WPS profile for thesimulation was subtracted from the corrected WPS profile for CH01 tocorrect for signal that was a product of fragment length and ligationbias. This final profile was plotted and termed the “Adjusted WPS”.

Genomic features, such as transcription start sites, transcription endsites, start codons, splice donor, and splice acceptor sites wereobtained from Ensembl Build version 75. Adjusted WPS surrounding thesefeatures was calculated and plotted as described above for transcriptionfactor binding sites.

Analysis of Nucleosome Spacing Around CTCF Binding Sites andCorresponding WPS

CTCF sites used for this analysis first included clustered FIMOpredictions of CTCF binding sites (computationally predicted viamotifs). We then created two additional subsets of this set: 1)intersection with the set of CTCF ChlP-seq peaks available through theENCODE TfbsClusteredV3 (see above), and 2) intersection with a set ofCTCF sites that are experimentally observed to be active across 19tissues.

The positions of 10 nucleosomes on either side of the binding site wereextracted for each site. We calculated distances between all adjacentnucleosomes to obtain a distribution of inter-nucleosome distances foreach set of sites. The distribution of −1 to +1 nucleosome spacingchanged substantially, shifting to larger spacing, particularly in the230-270 bp range. This suggested that truly active CTCF sites largelyshift towards wider spacing between the −1 and +1 nucleosomes, and thata difference in WPS for both long and short read fractions mighttherefore be apparent. Therefore, the mean short and long fragment WPSat each position relative to the center of CTCF sites were additionallycalculated. To explore the effect of nucleosome spacing, this mean wastaken within bins of −1 to +1 nucleosome spacing of less than 160,160-200, 200-230, 230-270, 270-420, 420-460, and greater than 420 bp.These intervals approximately captured spacings of interest, such as thedominant peak and the emerging peak at 230-270 bp for more confidentlyactive sites.

Analysis of DNase I Hypersensitive Sites (DHS)

DHS peaks for 349 primary tissue and cell line samples in BED format byMaurano et al. (Science, vol. 337(6099), pp. 1190-95 (2012);“all_fdr0.05_hot” file, last modified Feb. 13, 2012) were downloadedfrom the University of Washington Encode database. Samples derived fromfetal tissues, comprising 233 of these peak sets, were removed from theanalysis as they behaved inconsistently within tissue type, possiblybecause of unequal representation of multiple cell types within eachtissue sample. 116 samples representing a variety of cell lineages wereretained for analysis. For the midpoint of each DHS peak in a particularset, the nearest upstream and downstream calls in the CH01 callset wereidentified, and the genomic distance between the centers of those twocalls was calculated. The distribution of all such distances wasvisualized for each DHS peak callset using a smoothed density estimatecalculated for distances between 0 and 500 bp.

Gene Expression Analysis

FPKM expression values, measured for 20,344 Ensembl gene identifiers in44 human cell lines and 32 primary tissues by the Human Protein Atlas(“ma.csv” file) were used in this study. For analyses across tissues,genes with less than 3 non-zero expression values were excluded (19,378genes passing this filter). The expression data set was provided withone decimal precession for the FPKM values. Thus, a zero expressionvalue (0.0) indicates expression between 0 and a value less than 0.05.Unless otherwise noted, the minimum expression value was set to 0.04FPKM before log₂-transformation of the expression values.

Smooth Periodograms and Smoothing of Trajectories

The long fragment WPS was used to calculate periodograms of genomicregions using Fast Fourier Transform (FFT, spec.pgram in the Rstatistical programming environment) with frequencies between 1/500bases and 1/100 bases. Parameters to smooth (3 bp Daniell smoother;moving average giving half weight to the end values) and de-trend thedata (i.e. subtract the mean of the series and remove a linear trend)are optionally additionally used.

Where indicated, the recursive time series filter as implemented in theR statistical programming environment was used to remove high frequencyvariation from trajectories. 24 filter frequencies (1/seq(5,100,4)) wereused, and the first 24 values of the trajectory as initial values wereused. Adjustments for the 24-value shift in the resulting trajectorieswere made by repeating the last 24 values of the trajectory.

Correlation of FFT Intensities and Expression Values

The intensity values as determined from smooth periodograms (FFT) in thecontext of gene expression for the 120-280 bp range were analyzed. AnS-shaped Pearson correlation between gene expression values and FFTintensities around the major inter-nucleosome distance peak wasobserved. A pronounced negative correlation was observed in the 193-199bp range. As a result, the intensities in this frequency range wereaveraged correlated with log₂-transformed expression values.

FURTHER EXAMPLES Example 7

A method of determining tissues and/or cell types giving rise to cellfree DNA (cfDNA) in a subject, the method comprising:

isolating cfDNA from a biological sample from the subject, the isolatedcfDNA comprising a plurality of cfDNA fragments;

determining a sequence associated with at least a portion of theplurality of cfDNA fragments;

determining a genomic location within a reference genome for at leastsome cfDNA fragment endpoints of the plurality of cfDNA fragments as afunction of the cfDNA fragment sequences; and

determining at least some of the tissues and/or cell types giving riseto the cfDNA fragments as a function of the genomic locations of atleast some of the cfDNA fragment endpoints.

Example 8

The method of Example 7 wherein the step of determining at least some ofthe tissues and/or cell types giving rise to the cfDNA fragmentscomprises comparing the genomic locations of at least some of the cfDNAfragment endpoints to one or more reference maps.

Example 9

The method of Example 7 or Example 8 wherein the step of determining atleast some of the tissues and/or cell types giving rise to the cfDNAfragments comprises performing a mathematical transformation on adistribution of the genomic locations of at least some of the cfDNAfragment endpoints.

Example 10

The method of Example 9 wherein the mathematical transformation includesa Fourier transformation.

Example 11

The method of any preceding Example further comprising determining ascore for each of at least some coordinates of the reference genome,wherein the score is determined as a function of at least the pluralityof cfDNA fragment endpoints and their genomic locations, and wherein thestep of determining at least some of the tissues and/or cell typesgiving rise to the observed cfDNA fragments comprises comparing thescores to one or more reference map.

Example 12

The method of Example 11, wherein the score for a coordinate representsor is related to the probability that the coordinate is a location of acfDNA fragment endpoint.

Example 13

The method of any one of Examples 8 to 12 wherein the reference mapcomprises a DNase I hypersensitive site map generated from at least onecell-type or tissue.

Example 14

The method of any one of Examples 8 to 13 wherein the reference mapcomprises an RNA expression map generated from at least one cell-type ortissue.

Example 15

The method of any one of Examples 8 to 14 wherein the reference map isgenerated from cfDNA from an animal to which human tissues or cells thathave been xenografted.

Example 16

The method of any one of Examples 8 to 15 wherein the reference mapcomprises a chromosome conformation map generated from at least onecell-type or tissue.

Example 17

The method of any one of Examples 8 to 16 wherein the reference mapcomprises a chromatin accessibility map generated from at least onecell-type or tissue.

Example 18

The method of any one of Examples 8 to 17 wherein the reference mapcomprises sequence data obtained from samples obtained from at least onereference subject.

Example 19

The method of any one of Examples 8 to 18 wherein the reference mapcorresponds to at least one cell-type or tissue that is associated witha disease or a disorder.

Example 20

The method of any one of Examples 8 to 19 wherein the reference mapcomprises positions or spacing of nucleosomes and/or chromatosomes in atissue or cell type.

Example 21

The method of any one of Examples 8 to 20 wherein the reference map isgenerated by digesting chromatin obtained from at least one cell-type ortissue with an exogenous nuclease (e.g., micrococcal nuclease).

Example 22

The method of any one of Examples 8 to 21, wherein the reference mapscomprise chromatin accessibility data determined by atransposition-based method (e.g., ATAC-seq) from at least one cell-typeor tissue.

Example 23

The method of any one of Examples 8 to 22 wherein the reference mapscomprise data associated with positions of a DNA binding and/or DNAoccupying protein for a tissue or cell type.

Example 24

The method of Example 23 wherein the DNA binding and/or DNA occupyingprotein is a transcription factor.

Example 25

The method of Example 23 or Example 24 wherein the positions aredetermined by chromatin immunoprecipitation of a crosslinked DNA-proteincomplex.

Example 26

The method of Example 23 or Example 24 wherein the positions aredetermined by treating DNA associated with the tissue or cell type witha nuclease (e.g., DNase-I).

Example 27

The method of any one of Examples 8 to 26 wherein the reference mapcomprises a biological feature related to the positions or spacing ofnucleosomes, chromatosomes, or other DNA binding or DNA occupyingproteins within a tissue or cell type.

Example 28

The method of Example 27 wherein the biological feature is quantitativeexpression of one or more genes.

Example 29

The method of Example 27 or Example 28 wherein the biological feature ispresence or absence of one or more histone marks.

Example 30

The method of any one of Examples 27 to 29 wherein the biologicalfeature is hypersensitivity to nuclease cleavage.

Example 31

The method of any one of Examples 8 to 30 wherein the tissue or celltype used to generate a reference map is a primary tissue from a subjecthaving a disease or disorder.

Example 32

The method of Example 31 wherein the disease or disorder is selectedfrom the group consisting of: cancer, normal pregnancy, a complicationof pregnancy (e.g., aneuploid pregnancy), myocardial infarction,inflammatory bowel disease, systemic autoimmune disease, localizedautoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.

Example 33

The method of any one of Examples 8 to 30 wherein the tissue or celltype used to generate a reference map is a primary tissue from a healthysubject.

Example 34

The method of any one of Examples 8 to 30 wherein the tissue or celltype used to generate a reference map is an immortalized cell line.

Example 35

The method of any one of Examples 8 to 30 wherein the tissue or celltype used to generate a reference map is a biopsy from a tumor.

Example 36

The method of Example 18 wherein the sequence data comprises positionsof cfDNA fragment endpoints.

Example 37

The method of Example 36 wherein the reference subject is healthy.

Example 38

The method of Example 36 wherein the reference subject has a disease ordisorder.

Example 39

The method of Example 38 wherein the disease or disorder is selectedfrom the group consisting of: cancer, normal pregnancy, a complicationof pregnancy (e.g., aneuploid pregnancy), myocardial infarction,inflammatory bowel disease, systemic autoimmune disease, localizedautoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.

Example 40

The method of any one of Examples 19 to 39 wherein the reference mapcomprises reference scores for at least a portion of coordinates of thereference genome associated with the tissue or cell type.

Example 41

The method of Example 40 wherein the reference map comprises amathematical transformation of the scores.

Example 42

The method of Example 40 wherein the scores represent a subset of allreference genomic coordinates for the tissue or cell type.

Example 43

The method of Example 42 wherein the subset is associated with positionsor spacing of nucleosomes and/or chromatosomes.

Example 44

The method of Example 42 or Example 43 wherein the subset is associatedwith transcription start sites and/or transcription end sites.

Example 45

The method of any one of Examples 42 to 44 wherein the subset isassociated with binding sites of at least one transcription factor.

Example 46

The method of any one of Examples 42 to 45 wherein the subset isassociated with nuclease hypersensitive sites.

Example 47

The method of any one of Examples 40 to 46 wherein the subset isadditionally associated with at least one orthogonal biological feature.

Example 48

The method of Example 47 wherein the orthogonal biological feature isassociated with high expression genes.

Example 49

The method of Example 47 wherein the orthogonal biological feature isassociated with low expression genes.

Example 50

The method of any one of Examples 41 to 49 wherein the mathematicaltransformation includes a Fourier transformation.

Example 51

The method of any one of Examples 11 to 50 wherein at least a subset ofthe plurality of the scores has a score above a threshold value.

Example 52

The method of any one of Examples 7 to 51 wherein the step ofdetermining the tissues and/or cell types giving rise to the cfDNA as afunction of a plurality of the genomic locations of at least some of thecfDNA fragment endpoints comprises comparing a Fourier transform of theplurality of the genomic locations of at least some of the cfDNAfragment endpoints, or a mathematical transformation thereof, with areference map.

Example 53.

The method of any preceding Example further comprising generating areport comprising a list of the determined tissues and/or cell typesgiving rise to the isolated cfDNA.

Example 54

A method of identifying a disease or disorder in a subject, the methodcomprising:

isolating cell free DNA (cfDNA) from a biological sample from thesubject, the isolated cfDNA comprising a plurality of cfDNA fragments;

determining a sequence associated with at least a portion of theplurality of cfDNA fragments;

determining a genomic location within a reference genome for at leastsome cfDNA fragment endpoints of the plurality of cfDNA fragments as afunction of the cfDNA fragment sequences;

determining at least some of the tissues and/or cell types giving riseto the cfDNA as a function of the genomic locations of at least some ofthe cfDNA fragment endpoints; and

identifying the disease or disorder as a function of the determinedtissues and/or cell types giving rise to the cfDNA.

Example 55

The method of Example 54 wherein the step of determining the tissuesand/or cell types giving rise to the cfDNA comprises comparing thegenomic locations of at least some of the cfDNA fragment endpoints toone or more reference maps.

Example 56

The method of Example 54 or Example 55 wherein the step of determiningthe tissues and/or cell types giving rise to the cfDNA comprisesperforming a mathematical transformation on a distribution of thegenomic locations of at least some of the plurality of the cfDNAfragment endpoints.

Example 57

The method of Example 56 wherein the mathematical transformationincludes a Fourier transformation.

Example 58

The method of any one of Examples 54 to 57 further comprisingdetermining a score for each of at least some coordinates of thereference genome, wherein the score is determined as a function of atleast the plurality of cfDNA fragment endpoints and their genomiclocations, and wherein the step of determining at least some of thetissues and/or cell types giving rise to the observed cfDNA fragmentscomprises comparing the scores to one or more reference map.

Example 59

The method of Example 58, wherein the score for a coordinate representsor is related to the probability that the coordinate is a location of acfDNA fragment endpoint.

Example 60

The method of any one of Examples 55 to 59 wherein the reference mapcomprises a DNase I hypersensitive site map, an RNA expression map,expression data, a chromosome conformation map, a chromatinaccessibility map, chromatin fragmentation map, or sequence dataobtained from samples obtained from at least one reference subject, andcorresponding to at least one cell type or tissue that is associatedwith a disease or a disorder, and/or positions or spacing of nucleosomesand/or chromatosomes in a tissue or cell type.

Example 61

The method of any one of Examples 55 to 60 wherein the reference map isgenerated by digesting chromatin from at least one cell-type or tissuewith an exogenous nuclease (e.g., micrococcal nuclease).

Example 62

The method of Example 60 or Example 61, wherein the reference mapscomprise chromatin accessibility data determined by applying atransposition-based method (e.g., ATAC-seq) to nuclei or chromatin fromat least one cell-type or tissue.

Example 63

The method of any one of Examples 55 to 62 wherein the reference mapscomprise data associated with positions of a DNA binding and/or DNAoccupying protein for a tissue or cell type.

Example 64

The method of Example 63 wherein the DNA binding and/or DNA occupyingprotein is a transcription factor.

Example 65

The method of Example 63 or Example 64 wherein the positions aredetermined by applying chromatin immunoprecipitation of a crosslinkedDNA-protein complex to at least one cell-type or tissue.

Example 66

The method of Example 63 or Example 64 wherein the positions aredetermined by treating DNA associated with the tissue or cell type witha nuclease (e.g., DNase-I).

Example 67

The method of any one of Examples 54 to 66 wherein the reference mapcomprises a biological feature related to the positions or spacing ofnucleosomes, chromatosomes, or other DNA binding or DNA occupyingproteins within a tissue or cell type.

Example 68

The method of Example 67 wherein the biological feature is quantitativeexpression of one or more genes.

Example 69

The method of Example 67 or Example 68 wherein the biological feature ispresence or absence of one or more histone marks.

Example 70

The method of Example any one of Examples 67 to 69 wherein thebiological feature is hypersensitivity to nuclease cleavage.

Example 71

The method of any one of Examples 55 to 70 wherein the tissue or celltype used to generate a reference map is a primary tissue from a subjecthaving a disease or disorder.

Example 72

The method of Example 71 wherein the disease or disorder is selectedfrom the group consisting of: cancer, normal pregnancy, a complicationof pregnancy (e.g., aneuploid pregnancy), myocardial infarction,inflammatory bowel disease, systemic autoimmune disease, localizedautoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.

Example 73

The method of any one of Examples 55 to 70 wherein the tissue or celltype used to generate a reference map is a primary tissue from a healthysubject.

Example 74

The method of any one of Examples 55 to 70 wherein the tissue or celltype used to generate a reference map is an immortalized cell line.

Example 75

The method of any one of Examples 55 to 70 wherein the tissue or celltype used to generate a reference map is a biopsy from a tumor.

Example 76

The method of Example 60 wherein the sequence data obtained from samplesobtained from at least one reference subject comprises positions ofcfDNA fragment endpoint probabilities.

Example 77

The method of Example 76 wherein the reference subject is healthy.

Example 78

The method of Example 76 wherein the reference subject has a disease ordisorder.

Example 79

The method of Example 78 wherein the disease or disorder is selectedfrom the group consisting of: cancer, normal pregnancy, a complicationof pregnancy (e.g., aneuploid pregnancy), myocardial infarction,inflammatory bowel disease, systemic autoimmune disease, localizedautoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.

Example 80

The method of any one of Examples 60 to 79 wherein the reference mapcomprises cfDNA fragment endpoint probabilities for at least a portionof the reference genome associated with the tissue or cell type.

Example 81

The method of Example 80 wherein the reference map comprises amathematical transformation of the cfDNA fragment endpointprobabilities.

Example 82

The method of Example 80 wherein the cfDNA fragment endpointprobabilities represent a subset of all reference genomic coordinatesfor the tissue or cell type.

Example 83

The method of Example 82 wherein the subset is associated with positionsor spacing of nucleosomes and/or chromatosomes.

Example 84

The method of Example 82 or Example 83 wherein the subset is associatedwith transcription start sites and/or transcription end sites.

Example 85

The method of any one of Examples 82 to 84 wherein the subset isassociated with binding sites of at least one transcription factor.

Example 86

The method of any one of Examples 82 to 85 wherein the subset isassociated with nuclease hypersensitive sites.

Example 87

The method of any one of Examples 82 to 86 wherein the subset isadditionally associated with at least one orthogonal biological feature.

Example 88

The method of Example 87 wherein the orthogonal biological feature isassociated with high expression genes.

Example 89

The method of Example 87 wherein the orthogonal biological feature isassociated with low expression genes.

Example 90

The method of any one of Examples 81 to 89 wherein the mathematicaltransformation includes a Fourier transformation.

Example 91

The method of any one of Examples 58 to 90 wherein at least a subset ofthe plurality of the cfDNA fragment endpoint scores each has a scoreabove a threshold value.

Example 92

The method of any one of Examples 54 to 91 wherein the step ofdetermining the tissue(s) and/or cell type(s) of the cfDNA as a functionof a plurality of the genomic locations of at least some of the cfDNAfragment endpoints comprises comparing a Fourier transform of theplurality of the genomic locations of at least some of the cfDNAfragment endpoints, or a mathematical transformation thereof, with areference map.

Example 93

The method of any one of Examples 54 to 92 wherein the reference mapcomprises DNA or chromatin fragmentation data corresponding to at leastone tissue that is associated with the disease or disorder.

Example 94

The method of any one of Examples 54 to 93 wherein the reference genomeis associated with a human.

Example 95

The method of any one of Examples 54 to 94 further comprising generatinga report comprising a statement identifying the disease or disorder.

Example 96

The method of Example 95 wherein the report further comprises a list ofthe determined tissue(s) and/or cell type(s) of the isolated cfDNA.

Example 97

The method of any preceding Example wherein the biological samplecomprises, consists essentially of, or consists of whole blood,peripheral blood plasma, urine, or cerebral spinal fluid.

Example 98

A method for determining tissues and/or cell types giving rise tocell-free DNA (cfDNA) in a subject, comprising:

(i) generating a nucleosome map by obtaining a biological sample fromthe subject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA;

(ii) generating a reference set of nucleosome maps by obtaining abiological sample from control subjects or subjects with known disease,isolating the cfDNA from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of cfDNA; and

(iii) determining the tissues and/or cell types giving rise to the cfDNAby comparing the nucleosome map derived from the cfDNA to the referenceset of nucleosome maps;

wherein (a), (b) and (c) are:

(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a cfDNA fragment;

(b) the distribution of likelihoods that any pair of base-pairs of ahuman genome will appear as a pair of termini of a cfDNA fragment; and

(c) the distribution of likelihoods that any specific base-pair in ahuman genome will appear in a cfDNA fragment as a consequence ofdifferential nucleosome occupancy.

Example 99

A method for determining tissues and/or cell types giving rise tocell-free DNA in a subject, comprising:

(i) generating a nucleosome map by obtaining a biological sample fromthe subject, isolating the cfDNA from the biological sample, andmeasuring distributions (a), (b) and/or (c) by library construction andmassively parallel sequencing of cfDNA;

(ii) generating a reference set of nucleosome maps by obtaining abiological sample from control subjects or subjects with known disease,isolating the cfDNA from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of DNA derived from digestion of chromatin with micrococcalnuclease (MNase), DNase treatment, or ATAC-Seq; and

(iii) determining the tissues and/or cell types giving rise to the cfDNAby comparing the nucleosome map derived from the cfDNA to the referenceset of nucleosome maps;

wherein (a), (b) and (c) are:

(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a sequenced fragment;

(b) the distribution of likelihoods that any pair of base-pairs of ahuman genome will appear as a pair of termini of a sequenced fragment;and

(c) the distribution of likelihoods that any specific base-pair in ahuman genome will appear in a sequenced fragment as a consequence ofdifferential nucleosome occupancy.

Example 100

A method for diagnosing a clinical condition in a subject, comprising:

(i) generating a nucleosome map by obtaining a biological sample fromthe subject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA;

(ii) generating a reference set of nucleosome maps by obtaining abiological sample from control subjects or subjects with known disease,isolating the cfDNA from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of cfDNA; and

(iii) determining the clinical condition by comparing the nucleosome mapderived from the cfDNA to the reference set of nucleosome maps;

wherein (a), (b) and (c) are:

(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a cfDNA fragment;

(b) the distribution of likelihoods that any pair of base-pairs of ahuman genome will appear as a pair of termini of a cfDNA fragment; and

(c) the distribution of likelihoods that any specific base-pair in ahuman genome will appear in a cfDNA fragment as a consequence ofdifferential nucleosome occupancy.

Example 101

A method for diagnosing a clinical condition in a subject, comprising

(i) generating a nucleosome map by obtaining a biological sample fromthe subject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA;

(ii) generating a reference set of nucleosome maps by obtaining abiological sample from control subjects or subjects with known disease,isolating the cfDNA from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of DNA derived from digestion of chromatin with micrococcalnuclease (MNase), DNase treatment, or ATAC-Seq; and

(iii) determining the tissue-of-origin composition of the cfDNA bycomparing the nucleosome map derived from the cfDNA to the reference setof nucleosome maps;

wherein (a), (b) and (c) are:

(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a sequenced fragment;

(b) the distribution of likelihoods that any pair of base-pairs of ahuman genome will appear as a pair of termini of a sequenced fragment;and

(c) the distribution of likelihoods that any specific base-pair in ahuman genome will appear in a sequenced fragment as a consequence ofdifferential nucleosome occupancy.

Example 102

The method of any one of Examples 98-101, wherein the nucleosome map isgenerated by:

purifying the cfDNA isolated from the biological sample;

constructing a library by adaptor ligation and optionally PCRamplification; and

sequencing the resulting library.

Example 103

The method of any one of Examples 98-101, wherein the reference set ofnucleosome maps are generated by:

purifying cfDNA isolated from the biological sample from controlsubjects;

constructing a library by adaptor ligation and optionally PCRamplification; and

sequencing the resulting library.

Example 104

The method of any one of Examples 98-101, wherein distribution (a), (b)or (c), or a mathematical transformation of one of these distributions,is subjected to Fourier transformation in contiguous windows, followedby quantitation of intensities for frequency ranges that are associatedwith nucleosome occupancy, in order to summarize the extent to whichnucleosomes exhibit structured positioning within each contiguouswindow.

Example 105

The method of any one of Examples 98-101, wherein in distribution (a),(b) or (c), or a mathematical transformation of one of thesedistributions, we quantify the distribution of sites in the referencehuman genome to which sequencing read start sites map in the immediatevicinity of transcription factor binding sites (TFBS) of specifictranscription factor (TF), which are often immediately flanked bynucleosomes when the TFBS is bound by the TF, in order to summarizenucleosome positioning as a consequence of TF activity in the celltype(s) contributing to cfDNA.

Example 106

The method of any one of Examples 98-101, wherein the nucleosomeoccupancy signals are summarized in accordance with any one ofaggregating signal from distributions (a), (b), and/or (c), or amathematical transformation of one of these distributions, around othergenomic landmarks such as DNasel hypersensitive sites, transcriptionstart sites, topological domains, other epigenetic marks or subsets ofall such sites defined by correlated behavior in other datasets (e.g.gene expression, etc.).

Example 107

The method of any one of Examples 98-101, wherein the distributions aretransformed in order to aggregate or summarize the periodic signal ofnucleosome positioning within various subsets of the genome, e.g.quantifying periodicity in contiguous windows or, alternatively, indiscontiguous subsets of the genome defined by transcription factorbinding sites, gene model features (e.g. transcription start sites),tissue expression data or other correlates of nucleosome positioning.

Example 108

The method of any one of Examples 98-101, wherein the distributions aredefined by tissue-specific data, i.e. aggregate signal in the vicinityof tissue-specific DNase I hypersensitive sites.

Example 109

The method of any one of Examples 98-101, further comprising step ofstatistical signal processing for comparing additional nucleosome map(s)to the reference set.

Example 110

The method of Example 109, wherein we first summarize long-rangenucleosome ordering within contiguous windows along the genome in adiverse set of samples, and then perform principal components analysis(PCA) to cluster samples or to estimate mixture proportions.

Example 111

The method of Example 100 or Example 101, wherein the clinical conditionis cancer, i.e. malignancies.

Example 112

The method of Example 111, wherein the biological sample is circulatingplasma containing cfDNA, some portion of which is derived from a tumor.

Example 113

The method of Example 100 or Example 101, wherein the clinical conditionis selected from tissue damage, myocardial infarction (acute damage ofheart tissue), autoimmune disease (chronic damage of diverse tissues),pregnancy, chromosomal aberrations (e.g. trisomies), and transplantrejection.

Example 114

The method of any preceding Example further comprising assigning aproportion to each of the one or more tissues or cell types determinedto be contributing to cfDNA.

Example 115

The method of Example 114 wherein the proportion assigned to each of theone or more determined tissues or cell types is based at least in parton a degree of correlation or of increased correlation, relative tocfDNA from a healthy subject or subjects.

Example 116

The method of Example 114 or Example 115, wherein the degree ofcorrelation is based at least in part on a comparison of a mathematicaltransformation of the distribution of cfDNA fragment endpoints from thebiological sample with the reference map associated with the determinedtissue or cell type.

Example 117

The method of Example 114 to 116, wherein the proportion assigned toeach of the one or more determined tissues or cell types is based on amixture model.

From the foregoing, it will be appreciated that specific embodiments ofthe invention have been described herein for purposes of illustration,but that various modifications may be made without deviating from thescope of the invention. Accordingly, the invention is not limited exceptas by the appended claims.

1. A method of determining tissues and/or cell types giving rise to cellfree DNA (cfDNA) in a subject, the method comprising: isolating cfDNAfrom a biological sample from the subject, the isolated cfDNA comprisinga plurality of cfDNA fragments; determining a sequence associated withat least a portion of the plurality of cfDNA fragments; determining agenomic location within a reference genome for at least some cfDNAfragment endpoints of the plurality of cfDNA fragments as a function ofthe cfDNA fragment sequences; and determining at least some of thetissues and/or cell types giving rise to the cfDNA fragments as afunction of the genomic locations of at least some of the cfDNA fragmentendpoints.
 2. The method of claim 1 wherein the step of determining atleast some of the tissues and/or cell types giving rise to the cfDNAfragments comprises comparing the genomic locations of at least some ofthe cfDNA fragment endpoints to one or more reference maps.
 3. Themethod of claim 1 or claim 2 wherein the step of determining at leastsome of the tissues and/or cell types giving rise to the cfDNA fragmentscomprises performing a mathematical transformation on a distribution ofthe genomic locations of at least some of the cfDNA fragment endpoints.4. The method of claim 3 wherein the mathematical transformationincludes a Fourier transformation.
 5. The method of any one of claims 1to 4 further comprising determining a score for each of at least somecoordinates of the reference genome, wherein the score is determined asa function of at least the plurality of cfDNA fragment endpoints andtheir genomic locations, and wherein the step of determining at leastsome of the tissues and/or cell types giving rise to the observed cfDNAfragments comprises comparing the scores to one or more reference map.6. The method of claim 5, wherein the score for a coordinate representsor is related to the probability that the coordinate is a location of acfDNA fragment endpoint.
 7. The method of any one of claims 2 to 6wherein the reference map comprises a DNase I hypersensitive sitedataset generated from at least one cell-type or tissue; or wherein thereference map comprises an RNA expression dataset generated from atleast one cell-type or tissue; or wherein the reference map comprises achromosome conformation map generated from at least one cell-type ortissue; or wherein the reference map comprises a chromatin accessibilitymap generated from at least one cell-type or tissue.
 8. (canceled) 9.The method of any one of claims 2 to 7 wherein the reference map isgenerated from cfDNA from an animal to which human tissues or cells thathave been xenografted. 10-11. (canceled)
 12. The method of any one ofclaims 2 to 7 and 9 wherein the reference map comprises sequence dataobtained from samples obtained from at least one reference subject. 13.The method of any one of claims 2 to 7, 9 and 12 wherein the referencemap corresponds to at least one cell-type or tissue that is associatedwith a disease or a disorder or condition.
 14. The method of any one ofclaims 2 to 7, 9, and 12 to 13 wherein the reference map comprisespositions or spacing of nucleosomes and/or chromatosomes in a tissue orcell type.
 15. The method of any one of claims 2 to 7, 9, and 12 to 14wherein the reference map is generated by digesting chromatin obtainedfrom at least one cell-type or tissue with an exogenous nuclease (e.g.,micrococcal nuclease).
 16. The method of any one of claims 2 to 7, 9 and12 to 15, wherein the reference maps comprise chromatin accessibilitydata determined by a transposition-based method (e.g., ATAC-seq). 17.The method of any one of claims 2 to 7, 9 and 12 to 16 wherein thereference maps comprise data associated with positions of a DNA bindingand/or DNA occupying protein for a tissue or cell type.
 18. The methodof claim 17 wherein the DNA binding and/or DNA occupying protein is atranscription factor.
 19. The method of claim 17 or claim 18 wherein thepositions are determined by chromatin immunoprecipitation of acrosslinked DNA-protein complex.
 20. The method of claim 17 or claim 18wherein the positions are determined by treating DNA associated with thetissue or cell type with a nuclease (e.g., DNase-I).
 21. The method ofany one of claims 2 to 7, 9 and 12 to 20 wherein the reference mapcomprises a biological feature related to the positions or spacing ofnucleosomes, chromatosomes, or other DNA binding or DNA occupyingproteins within a tissue or cell type.
 22. The method of claim 21wherein the biological feature is quantitative expression of one or moregenes; or wherein the biological feature is presence or absence of oneor more histone marks; or wherein the biological feature ishypersensitivity to nuclease cleavage. 23-24. (canceled)
 25. The methodof any one of claims 2 to 7, 9, and 12 to 22 wherein the tissue or celltype used to generate a reference map is a primary tissue from a subjecthaving a disease or disorder or condition.
 26. The method of claim 25wherein the disease or disorder or condition is selected from the groupconsisting of: cancer, normal pregnancy, a complication of pregnancy(e.g., aneuploid pregnancy), myocardial infarction, inflammatory boweldisease, systemic autoimmune disease, localized autoimmune disease,allotransplantation with rejection, allotransplantation withoutrejection, stroke, and localized tissue damage.
 27. The method of anyone of claims 2 to 7, 9 and 12 to 22 wherein the tissue or cell typeused to generate a reference map is a primary tissue from a healthysubject; or wherein the tissue or cell type used to generate a referencemap is an immortalized cell line; or wherein the tissue or cell typeused to generate a reference map is a biopsy from a tumor. 28-29.(canceled)
 30. The method of claim 12 wherein the sequence datacomprises positions of cfDNA fragment endpoints.
 31. The method of claim30 wherein the reference subject is healthy; or wherein the referencesubject has a disease or disorder or condition.
 32. (canceled)
 33. Themethod of claim 31 wherein the disease or disorder or condition isselected from the group consisting of: cancer, normal pregnancy, acomplication of pregnancy (e.g., aneuploid pregnancy), myocardialinfarction, inflammatory bowel disease, systemic autoimmune disease,localized autoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.
 34. The method of any one of claims 13 to 22, 25 to 27, 30 to 31and 33 wherein the reference map comprises reference scores for at leasta portion of coordinates of the reference genome associated with thetissue or cell type.
 35. The method of claim 34 wherein the referencemap comprises a mathematical transformation of the scores.
 36. Themethod of claim 34 wherein the scores represent a subset of allreference genomic coordinates associated with the tissue or cell type.37. The method of claim 36 wherein the subset is associated withpositions or spacing of nucleosomes and/or chromatosomes; or wherein thesubset is associated with transcription start sites and/or transcriptionend sites; or wherein the subset is associated with binding sites of atleast one transcription factor; or wherein the subset is associated withnuclease hypersensitive sites. 38-40. (canceled)
 41. The method of anyone of claims 36 to 37 wherein the subset is additionally associatedwith at least one orthogonal biological feature.
 42. The method of claim41 wherein the orthogonal biological feature is associated with highexpression genes; or wherein the orthogonal biological feature isassociated with low expression genes.
 43. (canceled)
 44. The method ofany one of claims 35 to 37 and 41 to 42 wherein the mathematicaltransformation includes a Fourier transformation.
 45. The method of anyone of claims 5 to 7, 9, 12 to 22, 25 to 27, 30 to 31, 33 to 37, 41 to42, and 44 wherein at least a subset of the plurality of the scores eachhas a score above a threshold value.
 46. The method of any one of claims1 to 7, 9, 12 to 22, 25 to 27, 30 to 31, 33 to 37, 41 to 42, and 44 to45 wherein the step of determining the tissues and/or cell types givingrise to the cfDNA as a function of a plurality of the genomic locationsof at least some of the cfDNA fragment endpoints comprises comparing aFourier transform of the plurality of the genomic locations of at leastsome of the cfDNA fragment endpoints, or a mathematical transformationthereof, with a reference map.
 47. The method of any one of claims 1 to7, 9, 12 to 22, 25 to 27, 30, 31, 33 to 37, 41, 42, and 44 to 46 furthercomprising generating a report comprising a list of the determinedtissues and/or cell types giving rise to the isolated cfDNA.
 48. Amethod of identifying or diagnosing a disease or disorder or conditionin a subject, the method comprising: isolating cell free DNA (cfDNA)from a biological sample from the subject, the isolated cfDNA comprisinga plurality of cfDNA fragments; determining a sequence associated withat least a portion of the plurality of cfDNA fragments; determining agenomic location within a reference genome for at least some cfDNAfragment endpoints of the plurality of cfDNA fragments as a function ofthe cfDNA fragment sequences; optionally determining at least some ofthe tissues and/or cell types giving rise to the cfDNA as a function ofthe genomic locations of at least some of the cfDNA fragment endpoints;and identifying or diagnosing the disease or disorder or condition as afunction of the determined tissues and/or cell types giving rise to thecfDNA.
 49. The method of claim 48 wherein the optional step ofdetermining the tissues and/or cell types giving rise to the cfDNAcomprises comparing the genomic locations of at least some of the cfDNAfragment endpoints to one or more reference maps.
 50. The method ofclaim 48 or claim 49 wherein the step of determining the tissues and/orcell types giving rise to the cfDNA comprises performing a mathematicaltransformation on a distribution of the genomic locations of at leastsome of the plurality of the cfDNA fragment endpoints.
 51. The method ofclaim 50 wherein the mathematical transformation includes a Fouriertransformation.
 52. The method of any one of claims 48 to 51 furthercomprising determining a score for each of at least some coordinates ofthe reference genome, wherein the score is determined as a function ofat least the plurality of cfDNA fragment endpoints and their genomiclocations.
 53. The method of any one of claims 52, 112 and 113, whereinthe score for a coordinate represents or is related to the probabilitythat the coordinate is a location of a cfDNA fragment endpoint.
 54. Themethod of any one of claims 49 to 53 and 112 to 113 wherein thereference map comprises a DNase I hypersensitive site dataset, an RNAexpression dataset, expression data, a chromosome conformation map, achromatin accessibility map, chromatin fragmentation map, or sequencedata obtained from samples obtained from at least one reference subject,and corresponding to at least one cell type or tissue that is associatedwith a disease or a disorder or condition, and/or positions or spacingof nucleosomes and/or chromatosomes in a tissue or cell type.
 55. Themethod of any one of claims 49 to 54 and 112 to 114 wherein thereference map is generated by digesting chromatin from at least onecell-type or tissue with an exogenous nuclease (e.g., micrococcalnuclease); or wherein the reference map comprises chromatinaccessibility data determined by applying a transposition-based method(e.g., ATAC-seq) to nuclei or chromatin from at least one cell-type ortissue.
 56. (canceled)
 57. The method of any one of claims 49 to 55 and112 to 114 wherein the reference maps comprise data associated withpositions of a DNA binding and/or DNA occupying protein for a tissue orcell type.
 58. The method of claim 57 wherein the DNA binding and/or DNAoccupying protein is a transcription factor.
 59. The method of claim 57or claim 58 wherein the positions are determined by applying chromatinimmunoprecipitation of a crosslinked DNA-protein complex to at least onecell-type or tissue.
 60. The method of claim 57 or claim 58 wherein thepositions are determined by treating DNA associated with the tissue orcell type with a nuclease (e.g., DNase-I).
 61. The method of any one ofclaims 48 to 55, 57 to 60, and 112 to 114 wherein the reference mapcomprises a biological feature related to the positions or spacing ofnucleosomes, chromatosomes, or other DNA binding or DNA occupyingproteins within a tissue or cell type.
 62. The method of claim 61wherein the biological feature is quantitative expression of one or moregenes; or wherein the biological feature is presence or absence of oneor more histone marks; or wherein the biological feature ishypersensitivity to nuclease cleavage. 63-64. (canceled)
 65. The methodof any one of claims 49 to 55, 57 to 62 and 112 to 114 wherein thetissue or cell type used to generate a reference map is a primary tissuefrom a subject having a disease or disorder or condition.
 66. The methodof claim 65 wherein the disease or disorder or condition is selectedfrom the group consisting of: cancer, normal pregnancy, a complicationof pregnancy (e.g., aneuploid pregnancy), myocardial infarction,inflammatory bowel disease, systemic autoimmune disease, localizedautoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.
 67. The method of any one of claims 49 to 55, 57 to 62, 65, and112 to 114 wherein the tissue or cell type used to generate a referencemap is a primary tissue from a healthy subject; or wherein the tissue orcell type used to generate a reference map is an immortalized cell line;or wherein the tissue or cell type used to generate a reference map is abiopsy from a tumor. 68-69. (canceled)
 70. The method of claim 54wherein the sequence data obtained from samples obtained from at leastone reference subject comprises positions of cfDNA fragment endpoints.71. The method of claim 70 or 114 wherein at least one of the referencesubjects is healthy.
 72. The method of claim 70 or 114 wherein at leastone of the reference subjects has a disease or disorder or condition.73. The method of claim 72 wherein the disease or disorder or conditionis selected from the group consisting of: cancer, normal pregnancy, acomplication of pregnancy (e.g., aneuploid pregnancy), myocardialinfarction, inflammatory bowel disease, systemic autoimmune disease,localized autoimmune disease, allotransplantation with rejection,allotransplantation without rejection, stroke, and localized tissuedamage.
 74. The method of any one of claims 54 to 55, 57 to 62, 65 to67, 70 to 73, and 114 wherein the reference map comprises cfDNA fragmentendpoint probabilities for at least a portion of the reference genomeassociated with the tissue or cell type.
 75. The method of claim 74wherein the reference map comprises a mathematical transformation of thecfDNA fragment endpoint probabilities.
 76. The method of claim 74wherein the cfDNA fragment endpoint probabilities represent a subset ofall reference genomic coordinates for the tissue or cell type.
 77. Themethod of claim 76 wherein the subset is associated with positions orspacing of nucleosomes and/or chromatosomes; wherein the subset isassociated with transcription start sites and/or transcription endsites; or wherein the subset is associated with binding sites of atleast one transcription factor; or wherein the subset is associated withnuclease hypersensitive sites. 78-80. (canceled)
 81. The method of claim76 or claim 77 wherein the subset is additionally associated with atleast one orthogonal biological feature.
 82. The method of claim 81wherein the orthogonal biological feature is associated with highexpression genes; or wherein the orthogonal biological feature isassociated with low expression genes.
 83. (canceled)
 84. The method ofany one of claims 75 to 77 and 81 to 82 wherein the mathematicaltransformation includes a Fourier transformation.
 85. The method of anyone of claims 52 to 55, 57 to 62, 65 to 67, 70 to 77, 81 to 82, 84, and112 to 114 wherein at least a subset of the plurality of the cfDNAfragment endpoint scores each has a score above a threshold value. 86.The method of any one of claims 48 to 55, 57 to 62, 65 to 67, 70 to 77,81 to 82, 84 to 85, and 112 to 114 wherein the step of determining thetissue(s) and/or cell type(s) giving rise to the cfDNA as a function ofa plurality of the genomic locations of at least some of the cfDNAfragment endpoints comprises comparing a Fourier transform of theplurality of the genomic locations of at least some of the cfDNAfragment endpoints, or a mathematical transformation thereof, with areference map.
 87. The method of any one of claims 48 to 55, 57 to 62,65 to 67, 70 to 77, 81 to 82, 84 to 86, and 112 to 114 wherein thereference map comprises DNA or chromatin fragmentation datacorresponding to at least one tissue that is associated with the diseaseor disorder or condition.
 88. The method of any one of claims 48 to 55,57 to 62, 65 to 67, 70 to 77, 81 to 82, 84 to 87, and 112 to 114 whereinthe reference genome is associated with a human.
 89. The method of anyone of claims 48 to 55, 57 to 62, 65 to 67, 70 to 77, 81 to 82, 84 to88, and 112 to 114 further comprising generating a report comprising astatement identifying or diagnosing the disease or disorder orcondition.
 90. The method of claim 89 wherein the report furthercomprises a list of the determined tissue(s) and/or cell type(s) givingrise to the isolated cfDNA.
 91. The method of any one of claims 1 to 7,9, 12 to 22, 25 to 27, 30, 31, 33 to 37, 41, 42, 44 to 55, 57 to 62, 65to 67, 70 to 82, and 84 to 90 wherein the biological sample comprises,consists essentially of, or consists of whole blood, peripheral bloodplasma, urine, or cerebral spinal fluid.
 92. The method for determiningtissues and/or cell types giving rise to cell-free DNA (cfDNA) in asubject of claim 1, comprising: (i) generating a fragment endpoint mapby obtaining a biological sample from the subject, isolating the cfDNAfrom the biological sample, and measuring distributions (a), (b) and/or(c) by library construction and massively parallel sequencing of cfDNA;(ii) generating a reference set of fragment endpoint maps by obtaining abiological sample from control subjects or subjects with known diseaseor disorder or condition, isolating the cfDNA from the biologicalsample, measuring distributions (a), (b) and/or (c) by libraryconstruction and massively parallel sequencing of cfDNA; and (iii)determining the tissues and/or cell types giving rise to the cfDNA bycomparing the fragment endpoint map derived from the cfDNA to thereference set of fragment endpoint maps; wherein (a), (b) and (c) are:(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a cfDNA fragment; (b) thedistribution of likelihoods that any pair of base-pairs of a humangenome will appear as a pair of termini of a cfDNA fragment; and (c) thedistribution of likelihoods that any specific base-pair in a humangenome will appear in a cfDNA fragment as a consequence of differentialnucleosome occupancy.
 93. The method for determining tissues and/or celltypes giving rise to cell-free DNA in a subject of claim 1, comprising:(i) generating a fragment endpoint map by obtaining a biological samplefrom the subject, isolating the cfDNA from the biological sample, andmeasuring distributions (a), (b) and/or (c) by library construction andmassively parallel sequencing of cfDNA; (ii) generating a reference setof fragment endpoint maps by obtaining a biological sample from controlsubjects or subjects with known disease or disorder or condition,isolating chromatin from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of DNA derived from digestion of chromatin; and (iii)determining the tissues and/or cell types giving rise to the cfDNA bycomparing the nucicosomcfragment endpoint map derived from the cfDNA tothe reference set of nucleosome maps; wherein (a), (b) and (c) are: (a)the distribution of likelihoods any specific base-pair in a human genomewill appear at a terminus of a sequenced fragment; (b) the distributionof likelihoods that any pair of base-pairs of a human genome will appearas a pair of termini of a sequenced fragment; and (c) the distributionof likelihoods that any specific base-pair in a human genome will appearin a sequenced fragment as a consequence of differential nucleosomeoccupancy.
 94. The method for identifying or diagnosing a disease ordisorder or condition in a subject of claim 48, comprising: (i)generating a fragment endpoint map by obtaining a biological sample fromthe subject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA; (ii) generating a reference set offragment endpoint maps by obtaining a biological sample from controlsubjects or subjects with known disease or disorder or condition,isolating the cfDNA from the biological sample, measuring distributions(a), (b) and/or (c) by library construction and massively parallelsequencing of cfDNA; and (iii) determining the clinical condition bycomparing the fragment endpoint map derived from the cfDNA to thereference set of fragment endpoint maps; wherein (a), (b) and (c) are:(a) the distribution of likelihoods any specific base-pair in a humangenome will appear at a terminus of a cfDNA fragment; (b) thedistribution of likelihoods that any pair of base-pairs of a humangenome will appear as a pair of termini of a cfDNA fragment; and (c) thedistribution of likelihoods that any specific base-pair in a humangenome will appear in a cfDNA fragment as a consequence of differentialnucleosome occupancy.
 95. The method for identifying or diagnosing adisease or disorder or condition in a subject, comprising (i) generatinga fragment endpoint map by obtaining a biological sample from thesubject, isolating cfDNA from the biological sample, and measuringdistributions (a), (b) and/or (c) by library construction and massivelyparallel sequencing of cfDNA; (ii) generating a reference set ofnucleosome maps by obtaining a biological sample from control subjectsor subjects with known disease or disorder or condition, isolatingchromatin from the biological sample, measuring distributions (a), (b)and/or (c) by library construction and massively parallel sequencing ofDNA derived from digestion of chromatin; and (iii) determining thetissue-of-origin composition of the cfDNA by comparing the fragmentendpoint map derived from the cfDNA to the reference set of nucleosomemaps; wherein (a), (b) and (c) are: (a) the distribution of likelihoodsany specific base-pair in a human genome will appear at a terminus of asequenced fragment; (b) the distribution of likelihoods that any pair ofbase-pairs of a human genome will appear as a pair of termini of asequenced fragment; and (c) the distribution of likelihoods that anyspecific base-pair in a human genome will appear in a sequenced fragmentas a consequence of differential nucleosome occupancy.
 96. The method ofany one of claims 92-95, wherein the fragment endpoint map of thesubject is generated by: purifying the cfDNA isolated from thebiological sample; constructing a library by adaptor ligation andoptionally PCR amplification; and sequencing at least a portion of theresulting library.
 97. The method of claim 92 and 94, wherein at leastone of the reference set of fragment endpoint maps are generated by:purifying cfDNA isolated from the biological sample from controlsubjects; constructing a library by adaptor ligation and optionally PCRamplification; and sequencing at least a portion of the resultinglibrary.
 98. The method of any one of claims 92-95, wherein distribution(a), (b) or (c), or a mathematical transformation of one of thesedistributions, is subjected to Fourier transformation in contiguouswindows, followed by quantitation of intensities for frequency rangesthat are associated with nucleosome occupancy, in order to summarize theextent to which nucleosomes exhibit structured positioning within eachcontiguous window.
 99. The method of any one of claims 92-95, wheredistribution (a), (b) or (c), or a mathematical transformation of one ofthese distributions, is calculated for a subset of the genome.
 100. Themethod of claim, 99 wherein the subset comprises coordinates, within thereference genome, that are associated with the binding of at least onetranscription factor in at least one cell type or tissue, or areassociated with DNasel hypersensitive, or are associated withtranscription start sites, or are associated with topologicallyassociated domains.
 101. (canceled)
 102. The method of claim 99, whereinthe subset is defined by tissue-specific data e.g. tissue-specific DNaseI hypersensitivity.
 103. The method of any one of claims 92-95, furthercomprising step of statistical signal processing for comparingadditional nucleosome map(s) to the reference set.
 104. The method ofany one of claims 92-95, in which the comparison in (iii) comprisesclustering by principal components analysis (PCA) or by hierarchicalclustering.
 105. The method of claim 94 or claim 95 wherein the diseaseor disorder or condition is selected from the group consisting of:cancer, normal pregnancy, a complication of pregnancy (e.g., aneuploidypregnancy), myocardial infarction, inflammatory bowel disease, systemicautoimmune disease, localized autoimmune disease, allotransplantationwith rejection, allotransplantation without rejection, stroke, andlocalized tissue damage.
 106. The method of claim 105, wherein thebiological sample comprises, consists essentially of, or consists ofwhole blood, peripheral blood plasma, urine, or cerebral spinal fluid.107. (canceled)
 108. The method of any one of claims 1 to 7, 9, 12 to22, 25 to 27, 30, 31, 33 to 37, 41, 42, 44 to 55, 57 to 62, 65 to 67, 70to 82, 84 to 100, 102 to 106, and 112 to 114 further comprisingassigning a proportion to each of the one or more tissues or cell typesdetermined to be contributing to cfDNA.
 109. The method of claim 108wherein the proportion assigned to each of the one or more determinedtissues or cell types is based at least in part on the absolutemagnitude of correlation, or on the change in correlation relative tocfDNA from a healthy subject or subjects.
 110. The method of claim 108or claim 109, wherein the correlation is based at least in part on acomparison of a mathematical transformation of the distribution of cfDNAfragment endpoints from the biological sample with the reference mapassociated with the determined tissue or cell type.
 111. The method ofclaims 108 to 110, wherein the proportion assigned to each of the one ormore determined tissues or cell types is based on a mixture model. 112.The method of any one of claims 48 to 52 wherein the step of identifyingor diagnosing the disease or disorder or condition comprises comparingthe scores to one or more reference map.
 113. The method of any one ofclaims 48 to 52 and 112 wherein the step of determining at least some ofthe tissues and/or cell types giving rise to the observed cfDNAfragments comprises comparing the scores to one or more reference map.114. The method of any one of claims 49 to 53 and 112 to 113 wherein thereference map comprises sequence data obtained from samples obtainedfrom at least one reference subject.
 115. The method of any one ofclaims 92-95, where distribution (a), (b), or (c), or a mathematicaltransformation of one of these distributions, is calculated for a subsetof the genome.